3
$\begingroup$

Question

I'm working with large language models like Llama2 and need to understand all sources of randomness during inference for reproducibility and testing.

I assume that the primary source of randomness is in token sampling from the output distribution. Specifically:

1. Softmax converts logits to probabilities (deterministic):

$$ p_i = \frac{\exp(z_i / \tau)}{\sum_{j=1}^{V} \exp(z_j / \tau)} $$

where $z_i$ are the logits, $\tau$ is temperature, and $V$ is vocabulary size.

2. Token sampling from the categorical distribution (random):

$$ \text{next\_token} \sim \text{Categorical}(p_1, p_2, \ldots, p_V) $$

This is typically implemented via inverse transform sampling:

  1. Generate $u \sim \text{Uniform}(0, 1)$ ← Randomness here
  2. Compute CDF: $F(k) = \sum_{i=1}^{k} p_i$
  3. Select: $\text{next\_token} = \min\{k : F(k) \geq u\}$

My Questions

Is token sampling the ONLY source of randomness during inference?

  1. Other sources: Are there other hidden sources of randomness I should be aware of (e.g., in layer normalization, residual connections, KV cache, CUDA operations)?

  2. Framework-specific: I assume controlling the randomness depends on the framework in use (PyTorch vs TensorFlow vs Ollama)? Any pointers on what settings are needed for PyTorch specifically ?

New contributor
Rohit Khera is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$

2 Answers 2

6
$\begingroup$

The primary source of randomness definitely is the temperature/top-k/token sampling.

However, when doing inference on GPUs, it is near impossible stop all randomness. There is quite a bit of randomness in gpu ops (i.e. addition of floating point numbers for example) and kernel ops. Addition of chunked floating point number, for example, is a non-associative operation in the way that its usually implemented, and since there are a lot of threaded computations, the outcomes are never the exact same. The only way to stop this is to switch to single threaded CPU, which is just not feasible. The randomness that non-deterministic GPU ops introduce is minimal, and usually just neglected, but it does matter if you try and compare outputs of multiple runs to one another.

So yes, there are other sources. Most frameworks provide ways of minimizing it, by setting seeds, disabling non-deterministic algorithms in favor of deterministic ones, etc. However, doing computations on gpus will always have small sources of randomness. The only real way to go deterministic is to compute on cpus.

$\endgroup$
5
$\begingroup$

Robin already posted a great answer. I just wanted to emphasize that even upgrading a math library may change the "deterministic" output of the model. And for example torch.matmul returns a different output than using the naive matrix multiplication, even when running them on CPU. And since these small deviation accumulate across model layers, they may lead into different generated tokens.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.