Other projects.

Author	Message
Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113213 - Posted: 17 Nov 2025, 15:43:13 UTC - in response to Message 113211. all WCG credits have disappeared and not come back As can be seen in my sig here At the time of writing, WCG has returned to my Boincstat team stats, but the team total isn't updated quite yet. I think various of the numbers get updated at different points in the day, so I'm sure it'll right itself before much longer ID: 113213 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113215 - Posted: 18 Nov 2025, 1:45:35 UTC - in response to Message 113213. I think various of the numbers get updated at different points in the day, so I'm sure it'll right itself before much longer Which it did several hours ago tbf. Still waiting for WCG to update now ~330k credits ID: 113215 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113220 - Posted: 23 Nov 2025, 22:14:56 UTC Latest November 21, 2025 We are testing required changes to the scheduler and feeder to resolve the corrupt/truncated "os_name" and "os_version" entries such as "W"/"W" for some hosts, as reported by users in the forums, and to resolve frequent "stuck" feeder states where "No tasks available for platform" is logically incorrect by hr_class, yet the tasks populating the feeder shared memory segment remain unassigned by the scheduler passes and manual intervention is required to get work flowing again. Passes through uploaded results that have not been credited by the new system will begin next week, to backfill missing credits. We have been performing dry runs to establish correctness. As a precaution, we will be running the program in multiple passes starting with the oldest uploads, to the most recent. Volunteers have reported that the API sometimes shows an invalid state for multiple results, where only one result is marked valid, which should be impossible. Preliminary investigation points to the new MCM1 assimilation procedure interacting with the transitioner. The new MCM1 assimilation procedure acts to validate and credit all in progress results for a workunit as soon as it has consumed any pair/quorum of files, whether original 0 and 1 results or resends 2 and up, that have passed validation. We will review this issue in full and report our findings, whether a bug in the assimilator, or poorly modeled interaction between assimilator transactions and the transitioner, which is where we expect to find an explanation. No mention of "23/11/2025 18:16:37 \| World Community Grid \| Server error: feeder not running" Echoes of what was happening here. I was going to ask if anyone might try to sniff out whether a new server was usable, but a short while ago this problem got fixed and everything has now uploaded. No new tasks available yet to come down, though ID: 113220 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113222 - Posted: 24 Nov 2025, 11:18:43 UTC - in response to Message 113220. No new tasks available yet to come down, though 8hrs after writing that, tasks started coming through again ID: 113222 · Rating: 0 · rate: / Reply Quote

Bill Swisher Send message Joined: 10 Jun 13 Posts: 103 Credit: 67,322,882 RAC: 10,073	Message 113225 - Posted: 30 Nov 2025, 1:08:47 UTC - in response to Message 113222. Not exactly "Other projects", but does anyone have a clue as to what's going on around here? I got some tasks and since then it's become vewy, vewy qwiet. ID: 113225 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1935 Credit: 18,534,891 RAC: 0	Message 113226 - Posted: 30 Nov 2025, 5:07:04 UTC - in response to Message 113225. Not exactly "Other projects", but does anyone have a clue as to what's going on around here? I got some tasks and since then it's become vewy, vewy qwiet. It's been that way for 18 months+ now. Grant Darwin NT ID: 113226 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113232 - Posted: 5 Dec 2025, 5:26:35 UTC Latest update December 4, 2025 BOINC feeder/scheduler reporting "tasks committed to other platforms" is resolved - details are further down about the resolution and future plans to keep this issue from coming back. Validation Backlog has begun for workunits that were held over the break, and workunits that fell through our new validation logic unvalidated. We intend to ramp up these passes in the coming days, and will report on progress and project expected dates for fully backfilling all such cases and finally catching up validations to in flight work next week, now that we know our scripting works to backfill validations. We will not restart the file_deleter or db_purge BOINC services until we have validated every file we possess that was uploaded before/after the break, including sending resends for some cases of "orphans". What was the workaround for the feeder/scheduler blockage due to hr_class mismatch between results for the same workunit? The resolution to the issue that we chose for now, was to simply purge stale feeder entries effectively resetting their hr_class (homogenous redundancy) to 0 and allowing any host/platform to download the result if the result sits in memory for too long. The feeder can be started with a CLI option and specified time frame for occupancy of a result in a slot before it considers this course. What does resetting hr_class=0 as a workaround accomplish? The hr_class=0 reset matches the value assigned to fresh workunit results being sent out for the first time, essentially dictating to the scheduler that any host/platform may claim and compute this result (i.e., _0 and _1 results have hr_class=0, resends consult the hr_class of the host that reported results already). There is some computational overhead, as a second tier of validation is then required to validate the exact gene signatures and their scores are "the same" between these results computed on different platforms in the case of purged resends that had their hr_class reset to 0. We intend to disable hr_class (homogenous redundancy) completely for MCM1 at some point in the future, and instead rely directly on this currently secondary validation, and record of the delta between exact scores and verification of equivalent gene signatures found for these results sent to different platforms to ensure they are within a reasonable error bound/tolerance as a rule. Does this workaround affect the integrity of MCM1 results? No, but it does introduce a new edge cases to account for. The score can vary within the upper and lower bound of possible floating point error between platforms for the same workunit. Ensuring that the floating point calculations are not different enough to invalidate the computational result is a vastly easier problem when using the hr_class mechanism. However, because MCM1 produces a list of genes as well as a score, the only additional validation criteria we incur by disabling hr_class is ostensibly "score is just below the threshold on this system" exclusion, and "score is just above the threshold on this system" inclusion, for specific signatures very close to the configured threshold. In these cases, we can take the union of these additional results slightly above or below the threshold score, between all results for a workunit, provided the rest of the results above the threshold are equivalent. Why have hr_class at all for MCM1 then? Indeed. We intend to track the above cases and any other cases among validation failures where we can discern any unforseen effect of allowing resends to potentially go to different platforms, try this "disable hr_class if the feeder gets stuck" system for MAM1 which does have a numerical optimization routine to explore the signature search space that could change the actual signatures under test due to floating point error and so may not be a good candidate for this (and yet the calculations are valid, so any reasonable overlap or a "canary" or "spike-in" validation system might be considered sufficient validation...). If we are satisfied with the outcome of post-processing results that came from different platforms, we can disable it. This will accelerate throughput and discovery for MCM1 and possibly MAM1 while buying time to resolve this issue more permanently for applications such as ARP1 that this thinking does not apply to, where the floating point calculations must be byte-wise equivalent between results or the result is simply invalid. Once we can confirm that newer 8.x+ BOINC clients permitting WSL on Windows hosts is the only source of this hr_class confusion bug, and possibly the "W"/"W" os_name and os_version truncation bug, we can apply a targeted fix. ID: 113232 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113249 - Posted: 16 Dec 2025, 0:05:05 UTC New update - not looking good tbh December 15, 2025 Forum service restored, after degraded service starting roughly 03:00 UTC, December 15th, 2025 led to a crash at roughly 12:30 UTC same day - service was restored at approximately 20:00 UTC Dec 15th, 2025. We have seen this before, due to database connections waiting indefinitely or so long as to eventually reach the thread pool maximum, causing OOM kill of the forum application under WAS, this is the meaning of the ForumUserServlet unable to initialize message displayed while the application was down. Previously, poor parameterization of the thread pool under WAS caused connections to the database to stay open instead of timing out under lock contention, we thought we had ameliorated the issue on the WAS side, but clearly this needs another look. At the very least, we will deploy alerts for the specific WARN and ERROR messages logged by WAS leading up to the crash which should provide a window of many hours within which we can fix this manually in the future, before the forum application crashes, pending a confirmed fix or workaround (e.g. better parameters and logic for managing thread pool). After announcing we would be accelerating validations "in the coming days", validations stalled again last week. Unfortunately, we continue to face issues with MCM1 validations. There are multiple categories of missed validations - orphaned "singles", mis-routed one or both results, incorrectly invalidated result pairs, missing resend condition, and now floating point tolerance too stringent for hr_class reset workunits which was the workaround to the impossible platform logjam issue, at the expense of having to validate workunits based on similarity of scientific results within a tolerance instead of strict equivalence of scientific results. The concept of validity for MCM1 became result pairs that have equivalent gene signature membership for signatures above the threshold score, and an equivalent list of gene signatures above the threshold score, and a similar score within a configurable error bound passed to the validator on startup, when the two workunits run with the same parameters and random seed value. While these cases should therefore be validated by the secondary validator we subscribed to the "validation failure" queue downstream of the primary, checksum based validator, our tolerance for floating point error was too stringent and we will be replaying the failure queue from an earlier offset to catch these cases for recently resent workunits. Regarding our approach to crediting workunits held during the downtime by scanning the filesystem and checking the database, the process began last week after indexing the locations of result files for all workunits across all filesystems on the backend, so that validations that involved file transfers could avoid the processing of walking remote filesystems and simply fetch the required remote result from wherever it had been uploaded or archived. Initial testing suggested we would catch most if not all missed validations using this approach, though the scripts each running on each worker node would have to run for some time. Clearly, despite thinking perhaps the timestamp-based approach was simply getting through points in time with few missing validations of any case early on, we are not making the expected progress. We are reviewing logs and stats on what has been processed so far to figure out what we missed, and how to adjust. Some validations have occured for each case, just nowhere near the expected throughput/hit rate we projected. So, we are tentatively hopeful we can fix this quickly and start finally making a dent. ID: 113249 · Rating: 0 · rate: / Reply Quote

Butch Kemper Send message Joined: 18 Apr 20 Posts: 1 Credit: 2,398,843 RAC: 27	Message 113255 - Posted: 17 Dec 2025, 18:36:20 UTC Last modified: 17 Dec 2025, 18:58:00 UTC In the last few days, Rosetta has stopped running and this screen appears. The BOINC screen saver starts normally and when Rosetta starts, this window is displayed I had to take a picture because if I touch the keyboard or mouse, the screen disappears; Here is the picture: I can not get the image to display. This is the url https://ibb.co/Rpmqs2j7 copy and paste in your browser. Butch ID: 113255 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113257 - Posted: 17 Dec 2025, 19:02:42 UTC - in response to Message 113255. In the last few days, Rosetta has stopped running and this screen appears. The BOINC screen saver starts normally and when Rosetta starts, this window is displayed I had to take a picture because if I touch the keyboard or mouse, the screen disappears; Here is the picture: I can not get the image to display. This is the url https://ibb.co/Rpmqs2j7 copy and paste in your browser. I don't use the Boinc Screensaver and everything's running fine, but if I click on a running task and select "Show Graphics" it comes up with the same thing as your image, so it seems like a fault in their coding. Turn off the Boinc screensaver and I expect your tasks will run fine too. I don't expect there's anything any of us can do about it ID: 113257 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113320 - Posted: 25 Dec 2025, 21:47:49 UTC Latest update December 24, 2025 Pushed changes to BOINC transitioner to fix bug with upload bucket calculation for resends last night. This should drastically increase validations from resends going forward. We have also rebuilt the combined validation and assimilation pipeline for MCM1 and MAM1, which will finally enable us to start going through the validation backlog and clearing it out. We will try to leave a period of roughly 1-2 weeks for results to remain visible in a validated state (undeleted from the result table) as we validate and purge PV jail, which means removing the files from the in-memory cache and the result records from the database. We expect the process to take about 3 weeks. Working to resolve reported issue with profile changes again not propagating from website to BOINC clients the issue has a different presentation this time, we are working to resolve the issue. MAM1 beta workunits, and soon small smoke tests of the production pipeline, are being released - last week, we began smoke testing the MAM1 beta project (beta30) again, and we are working on the Windows and GPU-enabled builds in preparation for beginning the production run, keeping in mind the application we are using to run MAM1 will be backported to MCM1 as well so that we can take advantage of the modern features of the PyTorch/LibTorch backend. Thank you to volunteers for reporting outcomes in the forums across multiple threads. In the new year as we begin daily beta testing of MAM1 to roll out the initial production runs for the project and add platforms and GPU compat, we will have a dedicated thread for reporting issues, outcomes, and asking questions about the beta30 application and results on the forums. Thank You for supporting open science through WCG. Happy Holidays and Happy New Year to all volunteers! ID: 113320 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113339 - Posted: 28 Dec 2025, 3:12:37 UTC - in response to Message 113320. Pushed changes to BOINC transitioner to fix bug with upload bucket calculation for resends last night. This should drastically increase validations from resends going forward. We have also rebuilt the combined validation and assimilation pipeline for MCM1 and MAM1, which will finally enable us to start going through the validation backlog and clearing it out. We will try to leave a period of roughly 1-2 weeks for results to remain visible in a validated state (undeleted from the result table) as we validate and purge PV jail, which means removing the files from the in-memory cache and the result records from the database. We expect the process to take about 3 weeks. I wasn't quite paying attention, but I did a manual WCG update and I <think> I saw a couple of 100k credits added since my previous update maybe 15hrs previously. I haven't been running any WCG tasks for several days, so this would appear to be catching up on the backlog of tasks awaiting validation. I can't view my Results status page - presumably because there are so many to wade through - but it looks like whatever is being done above is working. Let's see where we are in 3 weeks. Still no credit update to Boincstats. I'm about 1.2m light atm ID: 113339 · Rating: 0 · rate: / Reply Quote

Bill Swisher Send message Joined: 10 Jun 13 Posts: 103 Credit: 67,322,882 RAC: 10,073	Message 113340 - Posted: 28 Dec 2025, 3:37:16 UTC - in response to Message 113339. I can't view my Results status page - presumably because there are so many to wade through ... Still no credit update to Boincstats. I'm about 1.2m light atm Yep, I try and it goes into oblivion. I've let it sit there for an hour and it never finishes whatever it's doing. The difference between what stats.free-dc.org shows and what a look at the Projects tab in BOINC Manager shows is almost 5.5m for me. ID: 113340 · Rating: 0 · rate: / Reply Quote

Garrulus glandarius Send message Joined: 25 Apr 25 Posts: 26 Credit: 3,390,940 RAC: 6,720	Message 113341 - Posted: 28 Dec 2025, 4:08:24 UTC Last modified: 28 Dec 2025, 4:10:21 UTC I managed to get the results page to load and can confirm that PV jail has fewer inmates. Used to have around 1400 tasks pending, now down to under 1200. Among them are still some that were reported back in August. ID: 113341 · Rating: 0 · rate: / Reply Quote

Garrulus glandarius Send message Joined: 25 Apr 25 Posts: 26 Credit: 3,390,940 RAC: 6,720	Message 113343 - Posted: 28 Dec 2025, 10:31:04 UTC - in response to Message 113341. Well, after 6 hours I only see a few newly reported tasks added to PV jail but non seem to have been validated. I guess it might be because of the hundreds of thousands of tasks waiting to be validated. ID: 113343 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113348 - Posted: 29 Dec 2025, 14:06:25 UTC - in response to Message 113340. I can't view my Results status page - presumably because there are so many to wade through ... Still no credit update to Boincstats. I'm about 1.2m light atm Yep, I try and it goes into oblivion. I've let it sit there for an hour and it never finishes whatever it's doing. The difference between what stats.free-dc.org shows and what a look at the Projects tab in BOINC Manager shows is almost 5.5m for me I left it overnight - I was no more successful. It is what it is. Further updates are only adding 1000 credits at a time. The mention of 3 weeks might even be optimistic, given I'm still not running any more WCG yet. I daren't think how many tasks I have pending validation - likely in the 1000s of tasks, both pre and post their shutdown ID: 113348 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113349 - Posted: 29 Dec 2025, 14:08:57 UTC - in response to Message 113341. I managed to get the results page to load and can confirm that PV jail has fewer inmates. Used to have around 1400 tasks pending, now down to under 1200. Among them are still some that were reported back in August. I'm glad you've got some information. At least it is doing something, from what you report. ID: 113349 · Rating: 0 · rate: / Reply Quote

Garrulus glandarius Send message Joined: 25 Apr 25 Posts: 26 Credit: 3,390,940 RAC: 6,720	Message 113350 - Posted: 29 Dec 2025, 14:12:59 UTC - in response to Message 113349. I managed to get the results page to load and can confirm that PV jail has fewer inmates. Used to have around 1400 tasks pending, now down to under 1200. Among them are still some that were reported back in August. I'm glad you've got some information. At least it is doing something, from what you report. It doesn't seem to be constant though. I'm only crunching WCG on an 8-threaded phone CPU and the rate at which tasks are usually validated is close to the rate at which that single phone is reporting new results. I checked my contribution graph and there was a big spike a few days ago. Guess it was my turn to have a batch validated and since then other users are the lucky ones. At least I hope that's the case, otherwise validation would be stagnating overall. ID: 113350 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2583 Credit: 47,220,881 RAC: 193	Message 113363 - Posted: 30 Dec 2025, 21:21:44 UTC - in response to Message 113350. I managed to get the results page to load and can confirm that PV jail has fewer inmates. Used to have around 1400 tasks pending, now down to under 1200. Among them are still some that were reported back in August. I'm glad you've got some information. At least it is doing something, from what you report. It doesn't seem to be constant though. I'm only crunching WCG on an 8-threaded phone CPU and the rate at which tasks are usually validated is close to the rate at which that single phone is reporting new results. I checked my contribution graph and there was a big spike a few days ago. Guess it was my turn to have a batch validated and since then other users are the lucky ones. At least I hope that's the case, otherwise validation would be stagnating overall. I was the same, though. An initial huge jump in validations followed by just 1-2000 credits per day. I think it's the latter of your suggestions. Doing something, but nothing like quickly enough. ID: 113363 · Rating: 0 · rate: / Reply Quote

Bill Swisher Send message Joined: 10 Jun 13 Posts: 103 Credit: 67,322,882 RAC: 10,073	Message 113393 - Posted: 7 Jan 2026, 15:21:55 UTC What a week, and it's not even half over. 1. Monday I go to the grocery store. When I get back a computer, 12 core/24 thread, has turned itself off. I hit the power button and it doesn't make it to the login screen before it turns itself off. I haul it into the local fixit joint and they confirm my suspicion. The liquid cooler isn't cooling, physically broken, less than a year old no less. 2. I spend my winters in Arizona, I live in Alaska (2410miles/3278kilometers away). All the computers in Alaska seemed to have gone off-line. Luckily(?) there was a power hit and it restarted them. I manage to login to one and 3 of them are running but essentially locked up. I manage to get into them. One is running Einstein tasks and I spot one trying to grab 4+Gb of memory, another only wanted 2+Gb. So I suspend Einstein and lock it from downloading any new tasks (on all the computers I can get to). One computer is locked up trying to run Rosetta tasks, not the beta ones, and they've sucked up all the memory. Same process with Rosetta as with Einstein across the computers (I'll let the local ones keep running those projects since I can physically reach the power button on them). 3. This computer is chugging along doing WCG stuff, then I notice that there's a job for Arthritis running that's grabbed all 16 threads for itself, but only using one of them. All the other tasks are suspended waiting for a processor. Meaning the computer was only running at 6.5% of it's capacity. I aborted that task. I'll drop a note over there, maybe somebody will notice. ID: 113393 · Rating: 0 · rate: / Reply Quote