Where does randomness occur in LLM inference beyond token sampling

Question

Question

I'm working with large language models like Llama2 and need to understand all sources of randomness during inference for reproducibility and testing.

I assume that the primary source of randomness is in token sampling from the output distribution. Specifically:

1. Softmax converts logits to probabilities (deterministic):

$$ p_i = \frac{\exp(z_i / \tau)}{\sum_{j=1}^{V} \exp(z_j / \tau)} $$

where $z_i$ are the logits, $\tau$ is temperature, and $V$ is vocabulary size.

2. Token sampling from the categorical distribution (random):

$$ \text{next\_token} \sim \text{Categorical}(p_1, p_2, \ldots, p_V) $$

This is typically implemented via inverse transform sampling:

Generate $u \sim \text{Uniform}(0, 1)$ ← Randomness here
Compute CDF: $F(k) = \sum_{i=1}^{k} p_i$
Select: $\text{next\_token} = \min\{k : F(k) \geq u\}$

My Questions

Is token sampling the ONLY source of randomness during inference?

Other sources: Are there other hidden sources of randomness I should be aware of (e.g., in layer normalization, residual connections, KV cache, CUDA operations)?
Framework-specific: I assume controlling the randomness depends on the framework in use (PyTorch vs TensorFlow vs Ollama)? Any pointers on what settings are needed for PyTorch specifically ?

Robin van Hoorn · Accepted Answer · 2025-12-04 07:20:38Z

The primary source of randomness definitely is the temperature/top-k/token sampling.

However, when doing inference on GPUs, it is near impossible stop all randomness. There is quite a bit of randomness in gpu ops (i.e. addition of floating point numbers for example) and kernel ops. Addition of chunked floating point number, for example, is a non-associative operation in the way that its usually implemented, and since there are a lot of threaded computations, the outcomes are never the exact same. The only way to stop this is to switch to single threaded CPU, which is just not feasible. The randomness that non-deterministic GPU ops introduce is minimal, and usually just neglected, but it does matter if you try and compare outputs of multiple runs to one another.

So yes, there are other sources. Most frameworks provide ways of minimizing it, by setting seeds, disabling non-deterministic algorithms in favor of deterministic ones, etc. However, doing computations on gpus will always have small sources of randomness. The only real way to go deterministic is to compute on cpus.

NikoNyrh · Accepted Answer · 2025-12-05 10:15:37Z

5

Robin already posted a great answer. I just wanted to emphasize that even upgrading a math library may change the "deterministic" output of the model. And for example torch.matmul returns a different output than using the naive matrix multiplication, even when running them on CPU. And since these small deviation accumulate across model layers, they may lead into different generated tokens.

answered 2 days ago

NikoNyrh

9125 silver badges9 bronze badges

Add a comment |

Stack Exchange Network

Where does randomness occur in LLM inference beyond token sampling

Question

1. Softmax converts logits to probabilities (deterministic):

2. Token sampling from the categorical distribution (random):

My Questions

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Where does randomness occur in LLM inference beyond token sampling

Question

1. Softmax converts logits to probabilities (deterministic):

2. Token sampling from the categorical distribution (random):

My Questions

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions