-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes #17748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c09526e to
222c9f8
Compare
Hi @max-krasnyansky , may I ask how do I determine whether the updated test was successful or failed? I've run the updated test on Mac M4 pro without the fix, but no hang. |
For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro). Here is an example on M4-Pro with the new test but without the fix: The issue is kind of obvious when you consider that the worker thread can get preempted in between reading It's possible to fix this by adding more logic (don't bump |
Thanks for your clarification. |
I'd say no need to reproduce the issue with the new test. Just see if this PR fixes your original issue. |
Now I can reproduce the crash with your regression test now. |
Can you please confirm if this PR fixes the issue you were seeing before? |
|
@ggerganov when you get the chance please ACK/NACK. |
Hi @max-krasnyansky , it makes little sense to just verify a particular model without deep understanding of these corner cases. I’d prefer getting a fully understanding of this issue. |
Agreed! In our Hunyuan case, we got So we should make sure |
|
@CISC @ggerganov @slaren |
|
@max-krasnyansky Slightly tangential (can't recall if we discussed this before), but on Mac with default Clang 17.0.0, the following command produces data race warnings when building with thread sanitizer enabled: # this is one of the CI tests
make -j && bin/test-thread-safety "-hf" "ggml-org/models" "-hff" "tinyllamas/stories15M-q4_0.gguf" "-ngl" "99" "-p" "The meaning of life is" "-n" "128" "-c" "256" "-ub" "32" "-np" "4" "-t" "2"I have been seeing these warnings for a long time and always ignored them because these are not reproducible on non-Apple machines with clang and gcc. Do you happen to have a clue what is happening here and do you see this also on your end? Here is what I get: Details |
Ah. This is kind of a funny case. This test uses lots of temporary threadpools that are allocated from different main/test threads. And this is the log I get It looks like the We could perhaps wrap that init code @ggerganov |
|
Thanks for tracking it down. Btw, the metal error is already fixed on |
|
|
I've seen that on and off before this change as well. |
True, seems to be 100% consistent now though, failed here and all the runs on master so far. |
Ok. I'll give that a shot. |
I would recommend running it on your own fork first, you'll get a swifter CI turnaround. |
…n_threads changes (ggml-org#17748) * tests: update barrier test to check for race condition in active threads * cpu: combine n_graph and n_threads into a single atomic update * tests: add multi-graph test for test_barrier
Notable changes: - Fix race conditions in threadpool (ggml-org#17748) - New CLI experience (ggml-org#17824) - Vision model improvements (clip refactor, new models) - Performance fixes (CUDA MMA, Vulkan improvements) - tools/main renamed to tools/completion Conflict resolution: - ggml-cpu.c: Use new threadpool->n_threads API (replaces n_threads_max), keep warning suppressed to reduce log noise 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The original discussion started in #17515
The short summary is that we have a race condition when the number of active threads changes rapidly while the worker threads are still in their hybrid polling loops.
I updated
test_barrierto test for this scenario. There is an additional test in there now that flip-flops between doinggraph_computewith 1 and N threads. Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).See this comment for the original report and analysis of the end-to-end use-cases that trigger this scenario
#17515 (comment)
This PR combines
n_graphandn_threads_cur(number of active threads) into a single atomic update.I played with a bunch of ideas and this seems to be the cleanest/simplest way to ensure all threads see a consistent state without adding extra logic. Also worth noting that adding additional memory ordering restriction (ie instead of doing relaxed reads) is not sufficient because the threads can get preempted in between the atomic reads and still end up with the inconsistent state.
Here is a quick test report from various systems:
AMD Ryzen 9 3950X (16-Cores) -- tested with and without OpenMP, with and without TSAN
Galaxy S24 Ultra (Gen3) -- no OpenMP, also tested Galaxy S25 (Gen4) and Gen5 device
Mac M4-Pron -- no OpenMP, with and without TSAN
Also tested all the usual stuff
llama-cliandllama-benchwith various models and backends with partial offloads.@DamonFool
Please give this a shot on your setup.
@jeffbolznv @ggerganov