Other projects.

Message boards : Cafe Rosetta : Other projects.

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2495
Credit: 46,551,772
RAC: 260
Message 113213 - Posted: 17 Nov 2025, 15:43:13 UTC - in response to Message 113211.  

all WCG credits have disappeared and not come back
As can be seen in my sig here

At the time of writing, WCG has returned to my Boincstat team stats, but the team total isn't updated quite yet.
I think various of the numbers get updated at different points in the day, so I'm sure it'll right itself before much longer
ID: 113213 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2495
Credit: 46,551,772
RAC: 260
Message 113215 - Posted: 18 Nov 2025, 1:45:35 UTC - in response to Message 113213.  

I think various of the numbers get updated at different points in the day, so I'm sure it'll right itself before much longer

Which it did several hours ago tbf.
Still waiting for WCG to update now ~330k credits
ID: 113215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2495
Credit: 46,551,772
RAC: 260
Message 113220 - Posted: 23 Nov 2025, 22:14:56 UTC

Latest
November 21, 2025
We are testing required changes to the scheduler and feeder to resolve the corrupt/truncated "os_name" and "os_version" entries such as "W"/"W" for some hosts, as reported by users in the forums, and to resolve frequent "stuck" feeder states where "No tasks available for platform" is logically incorrect by hr_class, yet the tasks populating the feeder shared memory segment remain unassigned by the scheduler passes and manual intervention is required to get work flowing again.
Passes through uploaded results that have not been credited by the new system will begin next week, to backfill missing credits. We have been performing dry runs to establish correctness. As a precaution, we will be running the program in multiple passes starting with the oldest uploads, to the most recent.
Volunteers have reported that the API sometimes shows an invalid state for multiple results, where only one result is marked valid, which should be impossible. Preliminary investigation points to the new MCM1 assimilation procedure interacting with the transitioner. The new MCM1 assimilation procedure acts to validate and credit all in progress results for a workunit as soon as it has consumed any pair/quorum of files, whether original 0 and 1 results or resends 2 and up, that have passed validation. We will review this issue in full and report our findings, whether a bug in the assimilator, or poorly modeled interaction between assimilator transactions and the transitioner, which is where we expect to find an explanation.

No mention of "23/11/2025 18:16:37 | World Community Grid | Server error: feeder not running"
Echoes of what was happening here.
I was going to ask if anyone might try to sniff out whether a new server was usable, but a short while ago this problem got fixed and everything has now uploaded.
No new tasks available yet to come down, though
ID: 113220 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2495
Credit: 46,551,772
RAC: 260
Message 113222 - Posted: 24 Nov 2025, 11:18:43 UTC - in response to Message 113220.  

No new tasks available yet to come down, though

8hrs after writing that, tasks started coming through again
ID: 113222 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Swisher
Avatar

Send message
Joined: 10 Jun 13
Posts: 88
Credit: 62,766,065
RAC: 17,795
Message 113225 - Posted: 30 Nov 2025, 1:08:47 UTC - in response to Message 113222.  

Not exactly "Other projects", but does anyone have a clue as to what's going on around here? I got some tasks and since then it's become vewy, vewy qwiet.
ID: 113225 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1901
Credit: 18,534,891
RAC: 0
Message 113226 - Posted: 30 Nov 2025, 5:07:04 UTC - in response to Message 113225.  

Not exactly "Other projects", but does anyone have a clue as to what's going on around here? I got some tasks and since then it's become vewy, vewy qwiet.
It's been that way for 18 months+ now.
Grant
Darwin NT
ID: 113226 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2495
Credit: 46,551,772
RAC: 260
Message 113232 - Posted: 5 Dec 2025, 5:26:35 UTC

Latest update
December 4, 2025
BOINC feeder/scheduler reporting "tasks committed to other platforms" is resolved - details are further down about the resolution and future plans to keep this issue from coming back.
Validation Backlog has begun for workunits that were held over the break, and workunits that fell through our new validation logic unvalidated. We intend to ramp up these passes in the coming days, and will report on progress and project expected dates for fully backfilling all such cases and finally catching up validations to in flight work next week, now that we know our scripting works to backfill validations.
We will not restart the file_deleter or db_purge BOINC services until we have validated every file we possess that was uploaded before/after the break, including sending resends for some cases of "orphans".
What was the workaround for the feeder/scheduler blockage due to hr_class mismatch between results for the same workunit? The resolution to the issue that we chose for now, was to simply purge stale feeder entries effectively resetting their hr_class (homogenous redundancy) to 0 and allowing any host/platform to download the result if the result sits in memory for too long. The feeder can be started with a CLI option and specified time frame for occupancy of a result in a slot before it considers this course.
What does resetting hr_class=0 as a workaround accomplish? The hr_class=0 reset matches the value assigned to fresh workunit results being sent out for the first time, essentially dictating to the scheduler that any host/platform may claim and compute this result (i.e., _0 and _1 results have hr_class=0, resends consult the hr_class of the host that reported results already). There is some computational overhead, as a second tier of validation is then required to validate the exact gene signatures and their scores are "the same" between these results computed on different platforms in the case of purged resends that had their hr_class reset to 0. We intend to disable hr_class (homogenous redundancy) completely for MCM1 at some point in the future, and instead rely directly on this currently secondary validation, and record of the delta between exact scores and verification of equivalent gene signatures found for these results sent to different platforms to ensure they are within a reasonable error bound/tolerance as a rule.
Does this workaround affect the integrity of MCM1 results? No, but it does introduce a new edge cases to account for. The score can vary within the upper and lower bound of possible floating point error between platforms for the same workunit. Ensuring that the floating point calculations are not different enough to invalidate the computational result is a vastly easier problem when using the hr_class mechanism. However, because MCM1 produces a list of genes as well as a score, the only additional validation criteria we incur by disabling hr_class is ostensibly "score is just below the threshold on this system" exclusion, and "score is just above the threshold on this system" inclusion, for specific signatures very close to the configured threshold. In these cases, we can take the union of these additional results slightly above or below the threshold score, between all results for a workunit, provided the rest of the results above the threshold are equivalent.
Why have hr_class at all for MCM1 then? Indeed. We intend to track the above cases and any other cases among validation failures where we can discern any unforseen effect of allowing resends to potentially go to different platforms, try this "disable hr_class if the feeder gets stuck" system for MAM1 which does have a numerical optimization routine to explore the signature search space that could change the actual signatures under test due to floating point error and so may not be a good candidate for this (and yet the calculations are valid, so any reasonable overlap or a "canary" or "spike-in" validation system might be considered sufficient validation...). If we are satisfied with the outcome of post-processing results that came from different platforms, we can disable it. This will accelerate throughput and discovery for MCM1 and possibly MAM1 while buying time to resolve this issue more permanently for applications such as ARP1 that this thinking does not apply to, where the floating point calculations must be byte-wise equivalent between results or the result is simply invalid. Once we can confirm that newer 8.x+ BOINC clients permitting WSL on Windows hosts is the only source of this hr_class confusion bug, and possibly the "W"/"W" os_name and os_version truncation bug, we can apply a targeted fix.

ID: 113232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Cafe Rosetta : Other projects.



©2025 University of Washington
https://www.bakerlab.org