Skip to content

Conversation

@max-krasnyansky
Copy link
Collaborator

The original discussion started in #17515

The short summary is that we have a race condition when the number of active threads changes rapidly while the worker threads are still in their hybrid polling loops.

I updated test_barrier to test for this scenario. There is an additional test in there now that flip-flops between doing graph_compute with 1 and N threads. Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

See this comment for the original report and analysis of the end-to-end use-cases that trigger this scenario
#17515 (comment)

This PR combines n_graph and n_threads_cur (number of active threads) into a single atomic update.
I played with a bunch of ideas and this seems to be the cleanest/simplest way to ensure all threads see a consistent state without adding extra logic. Also worth noting that adding additional memory ordering restriction (ie instead of doing relaxed reads) is not sufficient because the threads can get preempted in between the atomic reads and still end up with the inconsistent state.

Here is a quick test report from various systems:

AMD Ryzen 9 3950X (16-Cores) -- tested with and without OpenMP, with and without TSAN

$ ./build-amd64-omp/bin/test-barrier 16 1000
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 4176811 usec 
 4176.81 usec per-iter
 2088.41 nsec per-node

graph-compute with
 n_threads: 16
   n_nodes: 4
  n_rounds: 100000

$ ./build-amd64/bin/test-barrier 16 1000
graph-compute with
 n_threads: 16
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3982746 usec 
 3982.75 usec per-iter
 1991.37 nsec per-node
 
graph-compute with
 n_threads: 16
   n_nodes: 4
  n_rounds: 100000

Galaxy S24 Ultra (Gen3) -- no OpenMP, also tested Galaxy S25 (Gen4) and Gen5 device

~/src/llama.cpp-hexagon$ ./scripts/snapdragon/adb/run-tool.sh test-barrier 6 1000
...
graph-compute with
 n_threads: 6
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 1507086 usec 
 1507.09 usec per-iter
 753.543 nsec per-node

graph-compute with
 n_threads: 6
   n_nodes: 4
  n_rounds: 100000

Mac M4-Pron -- no OpenMP, with and without TSAN

$ ./build-macos/bin/test-barrier 10 1000
graph-compute with
 n_threads: 10
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 3080797 usec 
 3080.8 usec per-iter
 1540.4 nsec per-node

graph-compute with
 n_threads: 10
   n_nodes: 4
  n_rounds: 100000

Also tested all the usual stuff llama-cli and llama-bench with various models and backends with partial offloads.

@DamonFool
Please give this a shot on your setup.

@jeffbolznv @ggerganov

@DamonFool
Copy link
Contributor

Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

Hi @max-krasnyansky , may I ask how do I determine whether the updated test was successful or failed?
Does the failure also mean hang?

I've run the updated test on Mac M4 pro without the fix, but no hang.

@max-krasnyansky
Copy link
Collaborator Author

Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).

Hi @max-krasnyansky , may I ask how do I determine whether the updated test was successful or failed? Does the failure also mean hang?

I've run the updated test on Mac M4 pro without the fix, but no hang.

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Here is an example on M4-Pro with the new test but without the fix:

$ git log --oneline
41a2a6cfa (HEAD -> cpu-n_threads-race) tests: update barrier test to check for race condition in active threads
dea9ba27c (tag: b7261, upstream/master, upstream/HEAD, qcom/master, qcom/HEAD, origin/master, origin/HEAD) ggml-cpu: remove duplicate conditional check 'iid' (#17650)
c6d1a00aa Add a couple of file types to the text section (#17670)
424c57945 convert : support latest mistral-common (fix conversion with --mistral-format) (#17712)
...

$ cmake -D GGML_OPENMP=OFF -D GGML_SANITIZE_THREAD=OFF -D CMAKE_BUILD_TYPE=RelWithDebInfo -G Ninja -B build-macos

$ ./build-macos/bin/test-barrier 10 1000
graph-compute with
 n_threads: 10
   n_nodes: 2000
  n_rounds: 1000
graph-compute took 2753814 usec 
 2753.81 usec per-iter
 1376.91 nsec per-node
graph-compute with
 n_threads: 10
   n_nodes: 4
  n_rounds: 100000
Segmentation fault: 11  <<<<<<< 

The issue is kind of obvious when you consider that the worker thread can get preempted in between reading n_graph and n_threads_cur and start running again after we processed a new graph with a single thread and started prepping the next graph (ie already bumped n_threads_cur).

It's possible to fix this by adding more logic (don't bump n_graph for single threaded case, etc) but the fix I added is simple and robust -- graph-N must be processed with M-threads -- it doesn't matter when the workers threads get preempted they will always observe the consistent state.

@DamonFool
Copy link
Contributor

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Thanks for your clarification.
I'll try to reproduce the issue with your updated tests.

@max-krasnyansky
Copy link
Collaborator Author

For me it is crashing (segfault) because one or more threads would eventually get out of sync and start processing the graph while the cplan is getting updated. The race window is small, so you need more threads and more rounds (I used 10 threads with 1K rounds on M4-Pro).

Thanks for your clarification. I'll try to reproduce the issue with your updated tests.

I'd say no need to reproduce the issue with the new test. Just see if this PR fixes your original issue.
As I mentioned I tested it on Gen3 (S24 Ultra) with the new test and with the original Hexagon backend use-case.

@DamonFool
Copy link
Contributor

I'd say no need to reproduce the issue with the new test.

Now I can reproduce the crash with your regression test now.
The crash seems to be another story and I'll spend some time to understand the case this weekend.

@max-krasnyansky
Copy link
Collaborator Author

I'd say no need to reproduce the issue with the new test.

Now I can reproduce the crash with your regression test now. The crash seems to be another story and I'll spend some time to understand the case this weekend.

Can you please confirm if this PR fixes the issue you were seeing before?
I'm not quite sure why you seem to be ignoring this question and trying to reproduce the issue by using the test without the fix. This PR includes the test and the fix, and passes the new (and the old) tests and resolves the race condition.

@max-krasnyansky
Copy link
Collaborator Author

@ggerganov when you get the chance please ACK/NACK.
I got another PR that I want to layer on top with improvements to graph_plan.

@DamonFool
Copy link
Contributor

I'm not quite sure why you seem to be ignoring this question and trying to reproduce the issue by using the test without the fix.

Hi @max-krasnyansky , it makes little sense to just verify a particular model without deep understanding of these corner cases.

I’d prefer getting a fully understanding of this issue.
The regression test is a good way for debugging and analysis.

@DamonFool
Copy link
Contributor

Also worth noting that adding additional memory ordering restriction (ie instead of doing relaxed reads) is not sufficient

Agreed!

In our Hunyuan case, we got n_graph with older n_threads_cur. ==> may hang
In your regression test, I also saw n_graph with newer n_threads_cur. ==> may crash

So we should make sure n_graph and n_threads_cur are always matched.
And your solution is excellent! @max-krasnyansky
Thanks.

@max-krasnyansky
Copy link
Collaborator Author

@CISC @ggerganov @slaren
Can one of you guys please approve so that we can merge.

@ggerganov
Copy link
Member

ggerganov commented Dec 10, 2025

@max-krasnyansky Slightly tangential (can't recall if we discussed this before), but on Mac with default Clang 17.0.0, the following command produces data race warnings when building with thread sanitizer enabled:

# this is one of the CI tests
make -j && bin/test-thread-safety "-hf" "ggml-org/models" "-hff" "tinyllamas/stories15M-q4_0.gguf" "-ngl" "99" "-p" "The meaning of life is" "-n" "128" "-c" "256" "-ub" "32" "-np" "4" "-t" "2"

I have been seeing these warnings for a long time and always ignored them because these are not reproducible on non-Apple machines with clang and gcc. Do you happen to have a clue what is happening here and do you see this also on your end?

Here is what I get:

Details
0.02.065.838 I llama_context:        CPU compute buffer size =     0.07 MiB
0.02.065.841 I llama_context: graph nodes  = 193
0.02.065.850 I llama_context: graph splits = 2
==================
WARNING: ThreadSanitizer: data race (pid=85440)
  Write of size 8 at 0x00012fce8070 by thread T17:
    #0 ggml_threadpool_new_impl ggml-cpu.c:3089 (libggml-cpu.0.9.4.dylib:arm64+0x2254c)
    #1 ggml_graph_compute ggml-cpu.c:3175 (libggml-cpu.0.9.4.dylib:arm64+0x24394)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:186 (libggml-cpu.0.9.4.dylib:arm64+0x39b00)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:359 (libggml-base.0.9.4.dylib:arm64+0xaaa30)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1575 (libggml-base.0.9.4.dylib:arm64+0xd152c)
    #5 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1438 (libggml-base.0.9.4.dylib:arm64+0xccda0)
    #6 llama_context::graph_compute(ggml_cgraph*, bool) llama-context.cpp:1488 (libllama.0.0.7264.dylib:arm64+0x126aec)
    #7 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:809 (libllama.0.0.7264.dylib:arm64+0x125ea8)
    #8 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:757 (libllama.0.0.7264.dylib:arm64+0x125460)
    #9 llama_context::decode(llama_batch const&) llama-context.cpp:983 (libllama.0.0.7264.dylib:arm64+0x129eac)
    #10 main::$_0::operator()() const test-thread-safety.cpp:109 (test-thread-safety:arm64+0x100026894)
    #11 decltype(std::declval<main::$_0>()()) std::__1::__invoke[abi:ne200100]<main::$_0>(main::$_0&&) invoke.h:179 (test-thread-safety:arm64+0x100026134)
    #12 void std::__1::__thread_execute[abi:ne200100]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>&, std::__1::__tuple_indices<>) thread.h:205 (test-thread-safety:arm64+0x100025fb0)
    #13 void* std::__1::__thread_proxy[abi:ne200100]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>>(void*) thread.h:214 (test-thread-safety:arm64+0x100025200)
  Previous write of size 8 at 0x00012fce8070 by thread T11:
    #0 ggml_threadpool_new_impl ggml-cpu.c:3089 (libggml-cpu.0.9.4.dylib:arm64+0x2254c)
    #1 ggml_graph_compute ggml-cpu.c:3175 (libggml-cpu.0.9.4.dylib:arm64+0x24394)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:186 (libggml-cpu.0.9.4.dylib:arm64+0x39b00)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:359 (libggml-base.0.9.4.dylib:arm64+0xaaa30)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1575 (libggml-base.0.9.4.dylib:arm64+0xd152c)
    #5 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1438 (libggml-base.0.9.4.dylib:arm64+0xccda0)
    #6 llama_context::graph_compute(ggml_cgraph*, bool) llama-context.cpp:1488 (libllama.0.0.7264.dylib:arm64+0x126aec)
    #7 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:809 (libllama.0.0.7264.dylib:arm64+0x125ea8)
    #8 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:757 (libllama.0.0.7264.dylib:arm64+0x125460)
    #9 llama_context::decode(llama_batch const&) llama-context.cpp:983 (libllama.0.0.7264.dylib:arm64+0x129eac)
    #10 main::$_0::operator()() const test-thread-safety.cpp:109 (test-thread-safety:arm64+0x100026894)
    #11 decltype(std::declval<main::$_0>()()) std::__1::__invoke[abi:ne200100]<main::$_0>(main::$_0&&) invoke.h:179 (test-thread-safety:arm64+0x100026134)
    #12 void std::__1::__thread_execute[abi:ne200100]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>&, std::__1::__tuple_indices<>) thread.h:205 (test-thread-safety:arm64+0x100025fb0)
    #13 void* std::__1::__thread_proxy[abi:ne200100]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>>(void*) thread.h:214 (test-thread-safety:arm64+0x100025200)
  Thread T17 (tid=13432593, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2f708)
    #1 std::__1::__libcpp_thread_create[abi:ne200100](_opaque_pthread_t**, void* (*)(void*), void*) pthread.h:182 (test-thread-safety:arm64+0x10002508c)
    #2 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:224 (test-thread-safety:arm64+0x100024d58)
    #3 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:219 (test-thread-safety:arm64+0x100024ba4)
    #4 void std::__1::allocator<std::__1::thread>::construct[abi:ne200100]<std::__1::thread, main::$_0>(std::__1::thread*, main::$_0&&) allocator.h:153 (test-thread-safety:arm64+0x100024b04)
    #5 void std::__1::allocator_traits<std::__1::allocator<std::__1::thread>>::construct[abi:ne200100]<std::__1::thread, main::$_0, 0>(std::__1::allocator<std::__1::thread>&, std::__1::thread*, main::$_0&&) allocator_traits.h:309 (test-thread-safety:arm64+0x10002471c)
    #6 void std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::__construct_one_at_end[abi:ne200100]<main::$_0>(main::$_0&&) vector.h:742 (test-thread-safety:arm64+0x100024264)
    #7 std::__1::thread& std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::emplace_back<main::$_0>(main::$_0&&) vector.h:1133 (test-thread-safety:arm64+0x100004754)
    #8 main test-thread-safety.cpp:83 (test-thread-safety:arm64+0x100003734)
  Thread T11 (tid=13432587, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2f708)
    #1 std::__1::__libcpp_thread_create[abi:ne200100](_opaque_pthread_t**, void* (*)(void*), void*) pthread.h:182 (test-thread-safety:arm64+0x10002508c)
    #2 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:224 (test-thread-safety:arm64+0x100024d58)
    #3 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:219 (test-thread-safety:arm64+0x100024ba4)
    #4 void std::__1::allocator<std::__1::thread>::construct[abi:ne200100]<std::__1::thread, main::$_0>(std::__1::thread*, main::$_0&&) allocator.h:153 (test-thread-safety:arm64+0x100024b04)
    #5 void std::__1::allocator_traits<std::__1::allocator<std::__1::thread>>::construct[abi:ne200100]<std::__1::thread, main::$_0, 0>(std::__1::allocator<std::__1::thread>&, std::__1::thread*, main::$_0&&) allocator_traits.h:309 (test-thread-safety:arm64+0x10002471c)
    #6 void std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::__construct_one_at_end[abi:ne200100]<main::$_0>(main::$_0&&) vector.h:742 (test-thread-safety:arm64+0x100024264)
    #7 std::__1::thread& std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::emplace_back<main::$_0>(main::$_0&&) vector.h:1133 (test-thread-safety:arm64+0x100004754)
    #8 main test-thread-safety.cpp:83 (test-thread-safety:arm64+0x100003734)
SUMMARY: ThreadSanitizer: data race ggml-cpu.c:3089 in ggml_threadpool_new_impl
==================
==================
WARNING: ThreadSanitizer: data race (pid=85440)
  Write of size 8 at 0x00012fce8078 by thread T17:
    #0 ggml_threadpool_new_impl ggml-cpu.c:3090 (libggml-cpu.0.9.4.dylib:arm64+0x225ec)
    #1 ggml_graph_compute ggml-cpu.c:3175 (libggml-cpu.0.9.4.dylib:arm64+0x24394)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:186 (libggml-cpu.0.9.4.dylib:arm64+0x39b00)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:359 (libggml-base.0.9.4.dylib:arm64+0xaaa30)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1575 (libggml-base.0.9.4.dylib:arm64+0xd152c)
    #5 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1438 (libggml-base.0.9.4.dylib:arm64+0xccda0)
    #6 llama_context::graph_compute(ggml_cgraph*, bool) llama-context.cpp:1488 (libllama.0.0.7264.dylib:arm64+0x126aec)
    #7 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:809 (libllama.0.0.7264.dylib:arm64+0x125ea8)
    #8 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:757 (libllama.0.0.7264.dylib:arm64+0x125460)
    #9 llama_context::decode(llama_batch const&) llama-context.cpp:983 (libllama.0.0.7264.dylib:arm64+0x129eac)
    #10 main::$_0::operator()() const test-thread-safety.cpp:109 (test-thread-safety:arm64+0x100026894)
    #11 decltype(std::declval<main::$_0>()()) std::__1::__invoke[abi:ne200100]<main::$_0>(main::$_0&&) invoke.h:179 (test-thread-safety:arm64+0x100026134)
    #12 void std::__1::__thread_execute[abi:ne200100]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>&, std::__1::__tuple_indices<>) thread.h:205 (test-thread-safety:arm64+0x100025fb0)
    #13 void* std::__1::__thread_proxy[abi:ne200100]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>>(void*) thread.h:214 (test-thread-safety:arm64+0x100025200)
  Previous write of size 8 at 0x00012fce8078 by thread T11:
    #0 ggml_threadpool_new_impl ggml-cpu.c:3090 (libggml-cpu.0.9.4.dylib:arm64+0x225ec)
    #1 ggml_graph_compute ggml-cpu.c:3175 (libggml-cpu.0.9.4.dylib:arm64+0x24394)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:186 (libggml-cpu.0.9.4.dylib:arm64+0x39b00)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:359 (libggml-base.0.9.4.dylib:arm64+0xaaa30)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1575 (libggml-base.0.9.4.dylib:arm64+0xd152c)
    #5 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1438 (libggml-base.0.9.4.dylib:arm64+0xccda0)
    #6 llama_context::graph_compute(ggml_cgraph*, bool) llama-context.cpp:1488 (libllama.0.0.7264.dylib:arm64+0x126aec)
    #7 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:809 (libllama.0.0.7264.dylib:arm64+0x125ea8)
    #8 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:757 (libllama.0.0.7264.dylib:arm64+0x125460)
    #9 llama_context::decode(llama_batch const&) llama-context.cpp:983 (libllama.0.0.7264.dylib:arm64+0x129eac)
    #10 main::$_0::operator()() const test-thread-safety.cpp:109 (test-thread-safety:arm64+0x100026894)
    #11 decltype(std::declval<main::$_0>()()) std::__1::__invoke[abi:ne200100]<main::$_0>(main::$_0&&) invoke.h:179 (test-thread-safety:arm64+0x100026134)
    #12 void std::__1::__thread_execute[abi:ne200100]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>&, std::__1::__tuple_indices<>) thread.h:205 (test-thread-safety:arm64+0x100025fb0)
    #13 void* std::__1::__thread_proxy[abi:ne200100]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>>(void*) thread.h:214 (test-thread-safety:arm64+0x100025200)
  Thread T17 (tid=13432593, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2f708)
    #1 std::__1::__libcpp_thread_create[abi:ne200100](_opaque_pthread_t**, void* (*)(void*), void*) pthread.h:182 (test-thread-safety:arm64+0x10002508c)
    #2 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:224 (test-thread-safety:arm64+0x100024d58)
    #3 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:219 (test-thread-safety:arm64+0x100024ba4)
    #4 void std::__1::allocator<std::__1::thread>::construct[abi:ne200100]<std::__1::thread, main::$_0>(std::__1::thread*, main::$_0&&) allocator.h:153 (test-thread-safety:arm64+0x100024b04)
    #5 void std::__1::allocator_traits<std::__1::allocator<std::__1::thread>>::construct[abi:ne200100]<std::__1::thread, main::$_0, 0>(std::__1::allocator<std::__1::thread>&, std::__1::thread*, main::$_0&&) allocator_traits.h:309 (test-thread-safety:arm64+0x10002471c)
    #6 void std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::__construct_one_at_end[abi:ne200100]<main::$_0>(main::$_0&&) vector.h:742 (test-thread-safety:arm64+0x100024264)
    #7 std::__1::thread& std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::emplace_back<main::$_0>(main::$_0&&) vector.h:1133 (test-thread-safety:arm64+0x100004754)
    #8 main test-thread-safety.cpp:83 (test-thread-safety:arm64+0x100003734)
  Thread T11 (tid=13432587, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2f708)
    #1 std::__1::__libcpp_thread_create[abi:ne200100](_opaque_pthread_t**, void* (*)(void*), void*) pthread.h:182 (test-thread-safety:arm64+0x10002508c)
    #2 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:224 (test-thread-safety:arm64+0x100024d58)
    #3 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:219 (test-thread-safety:arm64+0x100024ba4)
    #4 void std::__1::allocator<std::__1::thread>::construct[abi:ne200100]<std::__1::thread, main::$_0>(std::__1::thread*, main::$_0&&) allocator.h:153 (test-thread-safety:arm64+0x100024b04)
    #5 void std::__1::allocator_traits<std::__1::allocator<std::__1::thread>>::construct[abi:ne200100]<std::__1::thread, main::$_0, 0>(std::__1::allocator<std::__1::thread>&, std::__1::thread*, main::$_0&&) allocator_traits.h:309 (test-thread-safety:arm64+0x10002471c)
    #6 void std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::__construct_one_at_end[abi:ne200100]<main::$_0>(main::$_0&&) vector.h:742 (test-thread-safety:arm64+0x100024264)
    #7 std::__1::thread& std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::emplace_back<main::$_0>(main::$_0&&) vector.h:1133 (test-thread-safety:arm64+0x100004754)
    #8 main test-thread-safety.cpp:83 (test-thread-safety:arm64+0x100003734)
SUMMARY: ThreadSanitizer: data race ggml-cpu.c:3090 in ggml_threadpool_new_impl
==================
==================
WARNING: ThreadSanitizer: data race (pid=85440)
  Write of size 8 at 0x00012fce8150 by thread T17:
    #0 ggml_threadpool_new_impl ggml-cpu.c:3098 (libggml-cpu.0.9.4.dylib:arm64+0x22adc)
    #1 ggml_graph_compute ggml-cpu.c:3175 (libggml-cpu.0.9.4.dylib:arm64+0x24394)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:186 (libggml-cpu.0.9.4.dylib:arm64+0x39b00)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:359 (libggml-base.0.9.4.dylib:arm64+0xaaa30)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1575 (libggml-base.0.9.4.dylib:arm64+0xd152c)
    #5 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1438 (libggml-base.0.9.4.dylib:arm64+0xccda0)
    #6 llama_context::graph_compute(ggml_cgraph*, bool) llama-context.cpp:1488 (libllama.0.0.7264.dylib:arm64+0x126aec)
    #7 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:809 (libllama.0.0.7264.dylib:arm64+0x125ea8)
    #8 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:757 (libllama.0.0.7264.dylib:arm64+0x125460)
    #9 llama_context::decode(llama_batch const&) llama-context.cpp:983 (libllama.0.0.7264.dylib:arm64+0x129eac)
    #10 main::$_0::operator()() const test-thread-safety.cpp:109 (test-thread-safety:arm64+0x100026894)
    #11 decltype(std::declval<main::$_0>()()) std::__1::__invoke[abi:ne200100]<main::$_0>(main::$_0&&) invoke.h:179 (test-thread-safety:arm64+0x100026134)
    #12 void std::__1::__thread_execute[abi:ne200100]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>&, std::__1::__tuple_indices<>) thread.h:205 (test-thread-safety:arm64+0x100025fb0)
    #13 void* std::__1::__thread_proxy[abi:ne200100]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>>(void*) thread.h:214 (test-thread-safety:arm64+0x100025200)
  Previous write of size 8 at 0x00012fce8150 by thread T11:
    #0 ggml_threadpool_new_impl ggml-cpu.c:3098 (libggml-cpu.0.9.4.dylib:arm64+0x22adc)
    #1 ggml_graph_compute ggml-cpu.c:3175 (libggml-cpu.0.9.4.dylib:arm64+0x24394)
    #2 ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:186 (libggml-cpu.0.9.4.dylib:arm64+0x39b00)
    #3 ggml_backend_graph_compute_async ggml-backend.cpp:359 (libggml-base.0.9.4.dylib:arm64+0xaaa30)
    #4 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1575 (libggml-base.0.9.4.dylib:arm64+0xd152c)
    #5 ggml_backend_sched_compute_splits(ggml_backend_sched*) ggml-backend.cpp:1438 (libggml-base.0.9.4.dylib:arm64+0xccda0)
    #6 llama_context::graph_compute(ggml_cgraph*, bool) llama-context.cpp:1488 (libllama.0.0.7264.dylib:arm64+0x126aec)
    #7 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:809 (libllama.0.0.7264.dylib:arm64+0x125ea8)
    #8 llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) llama-context.cpp:757 (libllama.0.0.7264.dylib:arm64+0x125460)
    #9 llama_context::decode(llama_batch const&) llama-context.cpp:983 (libllama.0.0.7264.dylib:arm64+0x129eac)
    #10 main::$_0::operator()() const test-thread-safety.cpp:109 (test-thread-safety:arm64+0x100026894)
    #11 decltype(std::declval<main::$_0>()()) std::__1::__invoke[abi:ne200100]<main::$_0>(main::$_0&&) invoke.h:179 (test-thread-safety:arm64+0x100026134)
    #12 void std::__1::__thread_execute[abi:ne200100]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>&, std::__1::__tuple_indices<>) thread.h:205 (test-thread-safety:arm64+0x100025fb0)
    #13 void* std::__1::__thread_proxy[abi:ne200100]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>>(void*) thread.h:214 (test-thread-safety:arm64+0x100025200)
  Thread T17 (tid=13432593, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2f708)
    #1 std::__1::__libcpp_thread_create[abi:ne200100](_opaque_pthread_t**, void* (*)(void*), void*) pthread.h:182 (test-thread-safety:arm64+0x10002508c)
    #2 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:224 (test-thread-safety:arm64+0x100024d58)
    #3 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:219 (test-thread-safety:arm64+0x100024ba4)
    #4 void std::__1::allocator<std::__1::thread>::construct[abi:ne200100]<std::__1::thread, main::$_0>(std::__1::thread*, main::$_0&&) allocator.h:153 (test-thread-safety:arm64+0x100024b04)
    #5 void std::__1::allocator_traits<std::__1::allocator<std::__1::thread>>::construct[abi:ne200100]<std::__1::thread, main::$_0, 0>(std::__1::allocator<std::__1::thread>&, std::__1::thread*, main::$_0&&) allocator_traits.h:309 (test-thread-safety:arm64+0x10002471c)
    #6 void std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::__construct_one_at_end[abi:ne200100]<main::$_0>(main::$_0&&) vector.h:742 (test-thread-safety:arm64+0x100024264)
    #7 std::__1::thread& std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::emplace_back<main::$_0>(main::$_0&&) vector.h:1133 (test-thread-safety:arm64+0x100004754)
    #8 main test-thread-safety.cpp:83 (test-thread-safety:arm64+0x100003734)
  Thread T11 (tid=13432587, running) created by main thread at:
    #0 pthread_create <null> (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2f708)
    #1 std::__1::__libcpp_thread_create[abi:ne200100](_opaque_pthread_t**, void* (*)(void*), void*) pthread.h:182 (test-thread-safety:arm64+0x10002508c)
    #2 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:224 (test-thread-safety:arm64+0x100024d58)
    #3 std::__1::thread::thread<main::$_0, 0>(main::$_0&&) thread.h:219 (test-thread-safety:arm64+0x100024ba4)
    #4 void std::__1::allocator<std::__1::thread>::construct[abi:ne200100]<std::__1::thread, main::$_0>(std::__1::thread*, main::$_0&&) allocator.h:153 (test-thread-safety:arm64+0x100024b04)
    #5 void std::__1::allocator_traits<std::__1::allocator<std::__1::thread>>::construct[abi:ne200100]<std::__1::thread, main::$_0, 0>(std::__1::allocator<std::__1::thread>&, std::__1::thread*, main::$_0&&) allocator_traits.h:309 (test-thread-safety:arm64+0x10002471c)
    #6 void std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::__construct_one_at_end[abi:ne200100]<main::$_0>(main::$_0&&) vector.h:742 (test-thread-safety:arm64+0x100024264)
    #7 std::__1::thread& std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread>>::emplace_back<main::$_0>(main::$_0&&) vector.h:1133 (test-thread-safety:arm64+0x100004754)
    #8 main test-thread-safety.cpp:83 (test-thread-safety:arm64+0x100003734)
SUMMARY: ThreadSanitizer: data race ggml-cpu.c:3098 in ggml_threadpool_new_impl
==================

...

@max-krasnyansky
Copy link
Collaborator Author

@max-krasnyansky Slightly tangential (can't recall if we discussed this before), but on Mac with default Clang 17.0.0, the following command produces data race warnings when building with thread sanitizer enabled:

# this is one of the CI tests
make -j && bin/test-thread-safety "-hf" "ggml-org/models" "-hff" "tinyllamas/stories15M-q4_0.gguf" "-ngl" "99" "-p" "The meaning of life is" "-n" "128" "-c" "256" "-ub" "32" "-np" "4" "-t" "2"

I have been seeing these warnings for a long time and always ignored them because these are not reproducible on non-Apple machines with clang and gcc. Do you happen to have a clue what is happening here and do you see this also on your end?

Ah. This is kind of a funny case.
The code that triggers this warning is:

    ggml/src/ggml-cpu/ggml-cpu.c:3094
    struct ggml_threadpool * threadpool =
        ggml_aligned_malloc(sizeof(struct ggml_threadpool));
    {
        threadpool->cgraph           = cgraph;     <<<<< this is 3094
        threadpool->cplan            = cplan;

This test uses lots of temporary threadpools that are allocated from different main/test threads.
Often times the malloc returns the same address for the new pool.
I instrumented the code with

    uint64_t tid;
    pthread_threadid_np(NULL, &tid);
    fprintf(stderr, "new threadppol %p (from %lu)\n", threadpool, (unsigned long) tid);

And this is the log I get

new threadppol 0x12f504000 (from 113878)
new threadppol 0x12f504000 (from 113885)  <<< same pointer reused
new threadppol 0x11cbf8000 (from 113887)
new threadppol 0x11cbf8000 (from 113886)  <<< same pointer reused
new threadppol 0x12f570000 (from 113881)
new threadppol 0x12f58c000 (from 113884)
new threadppol 0x12f598000 (from 113882)
new threadppol 0x12f5e4000 (from 113879)
new threadppol 0x12f5e4000 (from 113888)  <<< same pointer reused
new threadppol 0x12f5e4000 (from 113887)  <<< same pointer reused
==================
WARNING: ThreadSanitizer: data race (pid=7345)
  Write of size 8 at 0x00012f5e4070 by thread T17:                     << notice the address starting with 0x12f5e4000 
    #0 ggml_threadpool_new_impl ggml-cpu.c:3094 (libggml-cpu.0.9.4.dylib:arm64+0x62e0)
    ...
  Previous write of size 8 at 0x00012f5e4070 by thread T18:     << notice the address starting with 0x12f5e4000
    #0 ggml_threadpool_new_impl ggml-cpu.c:3094 (libggml-cpu.0.9.4.dylib:arm64+0x62e0)
    ...

It looks like the malloc is kind of transparent for TSAN (it's really a mempool internally anyway).
So from TSAN's perspective two different threads wrote to the same address without any synchronization.
i.e. Those main/test threads are not sharing anything relevant and not synchronizing in our code. As soon as one thread
releases the threadpool another one could pick up that same pointer. Though one would think that the malloc itself is of course doing synchronization between threads but perhaps the TSAN instrumentation is disabled for it?

We could perhaps wrap that init code __attribute__((no_sanitize("thread"))).
But it does seem like a TSAN issue and we might end up screwing up other checks because n_graph and n_barrier are initialized in there as well (not that those do not trigger a warning in this case because they are explicitly marked as atomic ie writeable from different threads).

@ggerganov
I'm going to merge this PR, let me know if you want to file an Issue to track this other issue.
Keeping TSAN output clean is useful otherwise people start ignoring other stuff.
For instance, on my M4 Pro the very same test also produces this warning

WARNING: ThreadSanitizer: data race (pid=7345)
  Write of size 8 at 0x00010b51a2b8 by thread T9:
    #0 ggml_metal_pipeline_set_smem ggml-metal-device.m:133 (libggml-metal.0.9.4.dylib:arm64+0x5368)
    #1 ggml_metal_library_get_pipeline_norm ggml-metal-device.cpp:1444 (libggml-metal.0.9.4.dylib:arm64+0xc0d4)
    ....
  Previous write of size 8 at 0x00010b51a2b8 by thread T7:
    #0 ggml_metal_pipeline_set_smem ggml-metal-device.m:133 (libggml-metal.0.9.4.dylib:arm64+0x5368)
    #1 ggml_metal_library_get_pipeline_norm ggml-metal-device.cpp:1444 (libggml-metal.0.9.4.dylib:arm64+0xc0d4)
   ...

@max-krasnyansky max-krasnyansky merged commit e1f4921 into ggml-org:master Dec 10, 2025
75 of 80 checks passed
@ggerganov
Copy link
Member

Thanks for tracking it down. Btw, the metal error is already fixed on master.

@CISC
Copy link
Collaborator

CISC commented Dec 10, 2025

test-barrier times out on macOS-latest-cmake-x64

@max-krasnyansky
Copy link
Collaborator Author

test-barrier times out on macOS-latest-cmake-x64

I've seen that on and off before this change as well.
Not sure how to hunt that down (ie don't have any older Macs here and doing that via CI is painful).

@max-krasnyansky max-krasnyansky deleted the cpu-n_threads-race branch December 10, 2025 21:56
@CISC
Copy link
Collaborator

CISC commented Dec 10, 2025

test-barrier times out on macOS-latest-cmake-x64

I've seen that on and off before this change as well. Not sure how to hunt that down (ie don't have any older Macs here and doing that via CI is painful).

True, seems to be 100% consistent now though, failed here and all the runs on master so far.

@max-krasnyansky
Copy link
Collaborator Author

test-barrier times out on macOS-latest-cmake-x64

I've seen that on and off before this change as well. Not sure how to hunt that down (ie don't have any older Macs here and doing that via CI is painful).

True, seems to be 100% consistent now though, failed here and all the runs on master so far.

Ok. I'll give that a shot.
I'll start a dummy PR, instrument the code a bit and see I can learn anything interesting from CI output.
Will probably need to hack the CI to just run one job.

@CISC
Copy link
Collaborator

CISC commented Dec 10, 2025

True, seems to be 100% consistent now though, failed here and all the runs on master so far.

Ok. I'll give that a shot. I'll start a dummy PR, instrument the code a bit and see I can learn anything interesting from CI output. Will probably need to hack the CI to just run one job.

I would recommend running it on your own fork first, you'll get a swifter CI turnaround.

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Dec 12, 2025
…n_threads changes (ggml-org#17748)

* tests: update barrier test to check for race condition in active threads

* cpu: combine n_graph and n_threads into a single atomic update

* tests: add multi-graph test for test_barrier
wangqi pushed a commit to wangqi/llama.cpp that referenced this pull request Dec 14, 2025
Notable changes:
- Fix race conditions in threadpool (ggml-org#17748)
- New CLI experience (ggml-org#17824)
- Vision model improvements (clip refactor, new models)
- Performance fixes (CUDA MMA, Vulkan improvements)
- tools/main renamed to tools/completion

Conflict resolution:
- ggml-cpu.c: Use new threadpool->n_threads API (replaces n_threads_max),
  keep warning suppressed to reduce log noise

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants