Question
I'm working with large language models like Llama2 and need to understand all sources of randomness during inference for reproducibility and testing.
I assume that the primary source of randomness is in token sampling from the output distribution. Specifically:
1. Softmax converts logits to probabilities (deterministic):
$$ p_i = \frac{\exp(z_i / \tau)}{\sum_{j=1}^{V} \exp(z_j / \tau)} $$
where $z_i$ are the logits, $\tau$ is temperature, and $V$ is vocabulary size.
2. Token sampling from the categorical distribution (random):
$$ \text{next\_token} \sim \text{Categorical}(p_1, p_2, \ldots, p_V) $$
This is typically implemented via inverse transform sampling:
- Generate $u \sim \text{Uniform}(0, 1)$ ← Randomness here
- Compute CDF: $F(k) = \sum_{i=1}^{k} p_i$
- Select: $\text{next\_token} = \min\{k : F(k) \geq u\}$
My Questions
Is token sampling the ONLY source of randomness during inference?
Other sources: Are there other hidden sources of randomness I should be aware of (e.g., in layer normalization, residual connections, KV cache, CUDA operations)?
Framework-specific: I assume controlling the randomness depends on the framework in use (PyTorch vs TensorFlow vs Ollama)? Any pointers on what settings are needed for PyTorch specifically ?