Osimons/gpu sampling equivalence checks #2
+705
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Spun up a script to investigate differences between backend-samplers and samplers working inside llama-cli/outside of ggml's opset. Main investigation insights:
std::uniform_real_distribution<double>returns different vals thanstd::uniform_real_distribution<float>when seeded with the same rng.top knot having to return sorted outputs, main differences come fromsoft_maxandcumsum, which will behave differently in ggml than in top-p/dist-sampler's due being parallelizeable across cores/smcumsum(exp(logits))vs. threshold based on(cumsum(probs)), did not yield that many differences. Thus did not push the codestd::partial_sortin llama-cpp samplers for sortingAfter this PR + fixing warmup currently advancing rng for backend-samplers by 1 call, outputs are much closer for backend-based compared to llama-cpp-based sampling.
Filed this as a separate PR as I'm not sure we want the vibe-coded script in main llama.cpp