Skip to content

Conversation

@ORippler
Copy link
Collaborator

@ORippler ORippler commented Dec 18, 2025

Spun up a script to investigate differences between backend-samplers and samplers working inside llama-cli/outside of ggml's opset. Main investigation insights:

  1. std::uniform_real_distribution<double> returns different vals than std::uniform_real_distribution<float> when seeded with the same rng.
  2. Apart from top k not having to return sorted outputs, main differences come from soft_max and cumsum, which will behave differently in ggml than in top-p/dist-sampler's due being parallelizeable across cores/sm
  3. Tried changing dist-sampler formulation (threshold based on cumsum(exp(logits)) vs. threshold based on (cumsum(probs)), did not yield that many differences. Thus did not push the code
  4. We use unstable std::partial_sort in llama-cpp samplers for sorting

After this PR + fixing warmup currently advancing rng for backend-samplers by 1 call, outputs are much closer for backend-based compared to llama-cpp-based sampling.

Filed this as a separate PR as I'm not sure we want the vibe-coded script in main llama.cpp

@ORippler ORippler force-pushed the osimons/gpu-sampling-equivalence-checks branch from 7ee23c0 to da27c9b Compare December 19, 2025 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant