Stories by Ryan Pégoud on Medium

Cutting LLM Memory by 84%, A Deep Dive into Fused Kernels

Ryan Pégoud — Thu, 19 Feb 2026 17:51:34 GMT

Why your final LLM layer is OOMing and how to fix it with a custom Triton kernel.

If you’ve ever trained or fine-tuned an LLM, you’ve likely hit a wall at the very last step: the Cross-Entropy Loss.

The culprit is the logit bottleneck. To predict the next token, we project a hidden state into a massive vocabulary space. For Llama 3 (128,256 tokens), the weight matrix alone is over 525 million parameters. While that’s only ~1GB in bfloat16, the intermediate logit tensor is the real issue: for large batches, it can easily exceed 80GB of VRAM just to compute a single scalar loss.

Optimising this layer is how libraries like Unsloth and Liger-Kernel achieve such massive memory reductions. In this article, we’ll build a fused Linear + Cross Entropy kernel from scratch in Triton. We will derive the math and implement a tiled forward and backward pass that slashes peak memory usage by 84%.

Note on Performance: This implementation is primarily educational. We prioritise mathematical clarity and readable Triton code by using global atomic operations. While it solves the memory bottleneck, matching production-grade speeds would require significantly more complex implementations which are out of scope for this article.

This post is part of my Triton series. We’ll be using concepts like tiling and online softmax that we’ve covered previously. If those sound unfamiliar, I recommend catching up there first!

The Logit Bottleneck

To get us started, let’s put some more numbers on the logit bottleneck. We consider an input matrix X with shape [NxD], a weight matrix W with shape [DxV] and a logit matrix Y=X@W with shape [NxV]. In the context of an LLM, N would be the sequence length multiplied by the batch size, D the size of the hidden state and V the vocabulary size.
For a Llama3 8B model, we would have a context window of 8192 tokens, a hidden state with 4096 dimensions and a vocabulary size of 128,256 tokens. Using a modest batch size of 8, we get N = 8192x8 = 65,536.
This results in the Y matrix having shape [NxV]=[65,536x128,256], or roughly 8.4 billion elements. In bfloat16, this would take up 16.8GB of memory. However, if we follow best practices and use float32 for the loss calculation to ensure numerical stability, the requirements double to 33.6GB.
To put this number in perspective, we would also need around 16GB of memory to hold the weights of Llama3 8B in memory in bfloat16. One most GPUs, this leaves no space for the massive overhead of the optimiser states (e.g. Adam’s moments) and other activations, resulting in the infamous PyTorch OOM error.

Representation of the input, weight and logit matrices along with their memory footprint. (All illustrations and animations in this article were made by the author unless specified otherwise)

Generally, this problem is dealt with by using:

Gradient accumulation: Use a smaller batch size and accumulate gradients over multiple batches between each optimiser step, emulating a larger batch size while holding less data in memory.
Activation checkpointing: PyTorch stores all intermediate activations for reuse in the backward pass, checkpointing clears these activations and recomputes them on-the-fly during the backward pass. This leads to large memory savings but increases training time since the number of required forward passes is doubled.
Micro-batching the loss: Instead of computing the loss over the N dimension at once, we can slice it and accumulate the loss over smaller chunks with size n < N. Now, we only hold a slice of size [n, V] in memory at a time.
Mixed precision training: Using half precision during training provides 2x memory reduction and significant speedups on Tensor Cores.

While these solutions seem attractive, they all have significant drawbacks: gradient accumulation and activation checkpointing slow down training, mixed precision can be unstable and micro-batching requires (slow) PyTorch level iteration and even though n is chosen to be smaller than N, the vocabulary size remains huge in comparison.
More importantly, these solutions do not address the problem we have dealt with repeatedly throughout this series: data movement. Indeed, we are still wasting time by writing billions of logits to VRAM only to read them back milliseconds later.

The Kernel Solution

As we’ll see in a minute, the forward and backward pass of the cross-entropy loss involve dot products, matrix multiplication and a softmax. As we learned in this series, these are all operations that can be tiled efficiently. In other words, we can perform them iteratively while only holding a small piece of the inputs in memory at any time.
Furthermore, cross-entropy is generally preceded by a matrix multiplication: the linear projection from the hidden state into the vocabulary space. This is a great opportunity for operator fusion: fusing multiple operation within a single kernel, resulting in large speedups and potential memory gains.
In the following sections, we’ll take a look at how to derive and efficiently fuse the forward and backward passes through a kernel combining a linear layer with cross-entropy.

Illustration of the Llama3 architecture, the operations handled by the kernel are highlighted in purple.

As mentioned in the last article, Triton kernels do not natively register in PyTorch’s autograd. Therefore we need to derive the gradient ourselves, a wonderful occasion to brush up on some calculus ;)

The math behind Fused Linear Cross-Entropy

Definition and Forward Pass

In this section, we derive the mathematical expression for our Fused Linear Cross-Entropy layer to see how it naturally lends itself to tiling.

For two discrete probability distributions p and q, cross-entropy is defined as:

In our context, p is the one-hot vector representing the target token, while q is the model’s distribution over the vocabulary. We obtain q by applying a softmax to the logits l, themselves the outputs of the preceding linear layer.

Since p is positive for a single target token y, the summation collapses. We can then substitute the numerically stable softmax (as discussed in the last article) to derive the final expression:

By substituting the logits l with the linear layer x . w, we see that the forward pass boils down to three primary quantities:

The target logit x . w_y.
The log-sum-exp (LSE) of all dot products.
The global maximum logit used for numerical stability.

Thanks to the online softmax algorithm, we can compute these quantities without ever materialising the full vocabulary in memory. Instead of an O(V) memory bottleneck, we iterate over the hidden dimension D and the vocabulary V in small tiles (D_block and V_block). This transforms the calculation into an O(1) register problem.

To parallelise this effectively, we launch one GPU program per row of the input matrix. Each program independently executes the following steps:

Pre-compute the target logit: Perform a tiled dot product between the current row of X and the column of W associated with token Y.
Online reduction: Iterate through the hidden and vocabulary blocks to:
1. Track the running maximum (m)
2. Update the running sum of exponentials (d) using the online softmax formula:

An example of tiled matrix multiplication for a single GPU program processing a row of X. The coloured squares represent elements loaded in memory and the coloured outline represent the complete tile that is iterated on. Tiling trades off speed for massive memory gains.

Now that we have a better understanding of the forward pass, let’s take a look at the derivation of the backward pass.

Backward Pass

Notation

To derive our gradients efficiently, we’ll use Einstein notation and the Kronecker delta.
In Einstein notation, repeated indices are implicitly summed over. For example, a standard matrix multiplication Y = X@W simplifies from a verbose summation to a clean index pairing:

The Kronecker delta (δ_ij) is used alongside this notation to handle identity logic. It is equal to 1 if i=j and 0 otherwise. As we’ll see, this is particularly useful for collapsing indices during differentiation.

Matrix Multiplication

In this section, we derive the back-propagated gradients for matrix multiplication. We assume the existence of an upstream gradient ℓ.

To determine how it back-propagates through matrix multiplication, we use the apply the chain rule to the inputs x and the weight matrix w. Here y represents the multiplication’s outputs:

We start by deriving the partial derivatives of y with respect to x, following these steps:

Express y in terms of x and w.
Notice that w is a constant with respect to the derivative of x, so we can pull it out of the derivative.
Express the fact that the partial derivative of x_ik with respect to x_mn is 1 only when i=m and k=n using the Kronecker delta.
Notice that ẟ_kn enforces k=n, therefore w_kj * ẟ_kn reduces to w_nj.

Then, we consider the full expression and obtain the gradient. We derive the last step by noticing once again that 1/y_ij * ẟ_im reduces to 1/y_mj.

However, matrix notation is conceptually closer to our Triton kernel, therefore, we rewrite this expression as a matrix multiplication by using the identity X_ij = [X^T]_ji:

We follow the exact same steps to derive the gradient with respect to W:

Then, the back-propagated gradient follows:

Which is equivalent to the matrix notation:

Cross-Entropy

In this section, we’ll focus on cross-entropy applied to discrete probability distributions. Considering a tensor of j logits, with a label y, the cross-entropy is computed as follows:

Where x_y corresponds to the logit associated to the label.
Once again, we are interested in the partial derivative of any output i with respect to any input k. Because of the normalising factor, every element i affects the value of every other element, therefore, the partial derivative is obtained by defining the function piecewise depending on the value of i:

Summing both cases, we obtain the gradient:

And in matrix notation:

Where y_{one hot} is a vector of zeros with the entry corresponding to the label set to one. This result tells us that the gradient is simply the difference between the prediction and the ground truth.

Fused Linear Cross-Entropy

Combining the linear projection with cross-entropy in a single expression, we get:

Thanks to the chain rule, deriving the gradient of this expression boils down to multiplying the gradients we computed previously:

Where x and y refer to the inputs and outputs to the linear layer respectively and w to the associated weight matrix.

Note: in a batched setting, we’ll need to reduce the W gradients over the batch dimension. Generally, we use a sum or mean reduction.

Kernel Implementation

With the theory established, we can implement the fused kernel in Triton. Since cross-entropy is typically the final layer in a language model, we can combine the forward and backward passes into a single kernel. This fusion offers two advantages: it minimises the overhead of multiple kernel launches and significantly improves data locality by keeping intermediate values on-chip.

We will analyse the kernel step-by-step from the perspective of a single program instance, which, in our parallelisation strategy, handles one specific row of the input matrix.

1. Setup and Target Logit Pre-computation

The initial phase involves standard Triton setup:

Program Identification: We use tl.program_id to determine which row of the input matrix the current program is responsible for.
Parameter Initialisation: We define tiles using D_BLOCK and V_BLOCK and initialise the running maximum (m) and sum (d) required for the online softmax algorithm.
Pointer Arithmetic: We calculate the base memory addresses for our tensors. Pointers for X (input) and dX (gradient) are offset using the row stride so each program accesses its unique token vector. Conversely, the W (weight) pointer remains at the base address because every program must eventually iterate through the entire vocabulary space.
Masking and Early Exit: We define an ignore_index (defaulting to -100). If a program encounters this label (e.g. for padding tokens), it terminates early with a loss of 0 to save cycles.

2. Computing the Target Logit

Before the main loop, we must isolate the target logit x . w_y. We iterate over the hidden dimension D in D_BLOCK chunks, performing a dot product between the input row X and the specific column of W corresponding to the ground-truth label Y.
Because W is a 2D matrix, calculating the pointers for these specific column tiles requires precise stride manipulation. The illustration below helps visualising how we “jump” through memory to extract only the necessary weights for the target token.

Representation of the pointer arithmetic executed to compute the target logit Y. Here, we consider that the label is 4, meaning that the target logit is X’s dot product with W’s 5th column. Vectors of different colours represent different steps of the iteration along D (i.e. different values of d_idx). Numbers refer to the memory address of each element assuming a row-major layout.

Once the tiles are loaded, we cast them to float32 to ensure numerical stability and add their dot product to an accumulator variable before moving to the next iteration.

Here’s the code so far:

https://medium.com/media/1551a2685e31109843e60f87081c7533/href

Next, we execute the forward pass, which processes the vocabulary space in two nested stages:

Tiled Logit Computation: We compute the logits for a V_BLOCK at a time. This is achieved by iterating over vocabulary dimension V (outer loop) and the hidden dimension D (inner loop). Within the inner loop, we load a tile of X and a block of W, accumulating their partial dot products into a high-precision register.
Online Softmax Update: Once the full dot product for a logit tile is finalised, we don’t store it to VRAM. Instead, we immediately update our running statistics: the maximum value m and the running sum of exponentials d using the online softmax formula. By doing this “on the fly”, we ensure that we only ever hold a small V_BLOCK of logits in the GPU’s registers at any given moment.

Following these iterations, the final values of m and d are used to reconstruct the LSE. The final scalar loss for the row is then computed by subtracting the target logit (x . w_y) from this LSE value.

Here’s a visual representation of the forward pass:

Visual representation of the tiled matrix multiplication with running statistics updates. At each step, we load elements coloured in green or dark blue and compute the dot products of vectors highlighted in green. Elements of Y are accumulated by iterating over the D dimension, when this is done (i.e. the cells are green), we update m and d based on the freshly computed tile.

Here’s the code for the forward pass:

https://medium.com/media/2d9c22ed7e470e2e8f9f1dc32078890d/href

We are now down to the last part of the kernel: the backward pass. Our goal is to compute the gradients with respect to X and W using the expression we derived earlier:

To remain memory-efficient, we once again process the vocabulary in tiles using a two-staged approach:

Recomputing Normalised Probabilities (P): Because we didn’t store the full logit matrix during the forward pass, we must recompute the activations for each tile. By reusing the Log-Sum-Exp calculated in the forward pass, we can normalise these activations on-the-fly. Subtracting the ground-truth label Y from the target logit within this tile gives us a local chunk of the gradient logit, P.
2. Gradient Accumulation: With a tile of P in hand, we calculate the partial gradients. For dX, we perform a dot product with blocks of W^T; for dW, we multiply by tiles of X^T. To safely aggregate these values across the entire batch, we use Triton’s tl.atomic_add.
This operation acts as a thread-safe +=, ensuring that different programs updating the same weight gradient do not overwrite one another.

Here are some additional details on the implementation:

The Stride Swap: When computing P . W_T, we don’t actually need to physically transpose the massive W matrix in memory. Instead, we invert the shapes and strides in W’s block pointer to read the rows of W as columns of W^T. This results in a “free” transpose that saves both time and VRAM.
Numerical Precision: It is worth noting that while X and W might be in bfloat16, the accumulation of dW and dX via atomic_add is usually performed in float32 to prevent the accumulation of tiny rounding errors across thousands of rows.
Contention Note: While atomic_add is necessary for dW (because every program updates the same weights), dX is private to each program, meaning there is zero contention between program IDs for that specific tensor.
Atomic Add Masking: atomic_add doesn’t support block pointers. Therefore, we implement the pointer and mask logic for dW explicitly.

The following figure is a representation of the backward pass for one iteration of the outer loop (i.e. one block along V and all blocks along D):

Representation of the backward pass for a single step along the V dimension and a full iteration along the D dimension. In stage 4, we highlight how dX is accumulated over iterations (every program updates its private row once per step along V) whereas dW is accumulated over programs (N programs update the values of a single block in dW at every step along V).

Here’s the full code for the backward pass:

https://medium.com/media/2a6fc3bffebe0825d734234611aad19c/href

This concludes the implementation of our kernel! The full code including the kernel and benchmark script is available here.

Memory Benchmark

Finally, we compare our kernel with the PyTorch baseline using hyperparameters inspired from Llama3 and an A100 GPU. Specifically, we consider a sequence length of S=16,384, a batch size of B=1 and an embedding dimension of D=4096; the vocabulary size is set to V=128,256.
As expected, the PyTorch baseline allocates a massive intermediate tensor to store the activations, resulting in a peak memory usage of 36.02GB. In comparison, our Triton kernel reduces the peak memory usage by 84% by allocating only 5.04GB using D_BLOCK=64 and V_BLOCK=64!
Using even smaller block sizes would allow for further memory gains at the cost of efficiency.

Atomic Limitations and Production Scaling

In this article, we focused on the technical and mathematical intuition behind fused Linear Cross-Entropy kernels. We used atomic operations like tl.atomic_add to keep the code minimal and readable. However, while our kernel successfully slashed memory usage by a staggering 86%, the Triton kernel is significantly slower than native PyTorch.
Unfortunately, the same atomic operations which make this kernel easier to write and comprehend come at the cost of a massive traffic jam since thousands of threads try to modify the same memory address at once. Generally, tl.atomic_add is performant when contention is low. In our current implementation, we have:

High Contention: For the weight gradient, every single program in the batch (up to 16,384 in our test) is trying to update the same memory tiles simultaneously. The hardware must serialise these updates, forcing thousands of threads to wait in line.
Numerical Non-associativity: In computers, floating-point addition is non-associative. Rounding errors can accumulate differently depending on the order of operations, which is why correctness tests might pass on a T4 but fail on an A100, the latter has more streaming multiprocessors (SMs) performing more concurrent, non-deterministic additions.

Note on Precision: On Ampere and newer architectures, the TF32 format can further contribute to these discrepancies. For strict numerical parity, one should set allow_tf32=False or use higher precision types during the accumulation steps.

Path to Production

To move beyond this educational implementation and toward a production-ready kernel (I recommend looking at the Liger-Kernel implementation), one could implement several optimisations:

Replacing dX Atomics: Since each program “owns” its row of X, we can use simple register accumulation followed by a tl.store, eliminating atomics for the input gradients entirely.
A dedicated dW Kernel: To optimise the computation of dW, production kernels generally use a different grid strategy where each program handles a block of W and iterates through the batch dimension, accumulating gradients locally before a single global write.
Micro-batching: Advanced implementations, such as those in the Liger-Kernel library, process the sequence by blocks along the N dimension, making the memory scaling constant in the sequence length rather than linear. This enables the use much larger batch sizes at a reduced memory cost.

Conclusion

This concludes our deep dive into fused linear cross-entropy kernels. Thanks for reading all the way through, I hope this article gave you both the intuition and the practical understanding needed to build on these ideas and explore them further.
If you found this useful, consider sharing the article; it genuinely helps support the time and effort that goes into producing this work. And as always, feel free to contact me if you have questions, thoughts, or ideas for follow-ups.

Until next time! 👋

Sources

Cutting LLM Memory by 84%, A Deep Dive into Fused Kernels was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

AlpamayoR1: Large Causal Reasoning Models for Autonomous Driving

Ryan Pégoud — Thu, 19 Feb 2026 13:30:33 GMT

All you need to know about Chain of Causation reasoning and the current state of Autonomous Driving!

Photo by Pramod Tiwari on Unsplash

Recently, Nvidia took the world of autonomous driving by storm with their new AlpamayoR1 architecture integrating a large Vision-Language Model as a causally-grounded reasoning backbone. This release, accompanied by a new large-scale dataset and a photo-realistic driving simulator, already positions the company as one of the main players in the field in 2026.

In this article, we’ll break down the AlpamayoR1 architecture, chain of causation reasoning, as well as the elaborate training procedure used to train the model.

The Current State of Autonomous Driving

The release of AlpamayoR1 (AR1) finds context in the current paradigm of End-to-End (E2E) architectures. E2E models aim to map raw sensory inputs (cameras, LiDAR, radar, …) to trajectories in a fully differentiable architecture optimising a unified objective.

An emerging trend in E2E involves leveraging the extensive world knowledge of large Vision-Language Models (VLMs) to tackle complex driving situations. This generally involves using VLMs as reasoning backbones to inform future trajectories or as expert teachers to provide supervisory signal to smaller student models.

The AR1 Architecture

AR1 is a prime example of the reasoning-VLM-as-a-backbone approach. Despite its massive size, the architecture is optimised for real-world deployment and runs a latency of 99ms or 10Hz on a single BlackWell GPU, which is considered to be a general target for safety reasons. In this section, we’ll break down the architecture and its numerous innovations.

High-level overview of the AR1 architecture, source: [1]

Vision Encoder

AR1 uses both visual and textual inputs in the form of tokenised camera feeds and natural language instructions. For performance, it is crucial for the vision encoder to produce as few tokens as possible.

To this end, the authors used a Vision Transformer (ViT)[2] for single-image tokenisation. ViTs partition images in a sequence of tokens encoded by a regular transformer. Note that the integration of more efficient algorithms like Flex [3] for multi-video tokenisation is left for future work.

Vision Transformer architecture, source: [2]

Reasoning Backbone

The AR1 architecture is built around Cosmos-Reason, one of Nvidia’s VLMs trained specifically for embodied reasoning in Physical AI use cases. Its usual training set includes 3.7M general Visual Question-Answering (VQA) samples to improve the model’s physical common set as well, complemented by 24.7K driving samples. These include video VQA annotated with DeepSeek-R1 reasoning traces to predict the next action.

Cosmos-Reason processes visual and text tokens along with the recent ego-history (past x-y positions and angle of the ego-vehicle) to output chain of causation reasoning traces to inform future trajectories.

Chain of Causation

A crucial limitation of language models lies in the inherent ambiguity of text labels in visual datasets. This includes vague descriptions lacking a causal structure. Models trained on such data exhibit a low correlation between their reasoning traces and predicted actions as well as causal confusion.

Driving datasets tend to include vague annotations with weak causal grounding, source: [1]

For an embodied agent like an autonomous car, strong causal reasoning abilities are essential. To circumvent those problems, the Nvidia team deployed significant efforts to create a driving dataset with causally consistent annotations.

Specifically, the dataset contains 20-second clips extracted from real-world driving recordings in various environments and countries. Each clip contains 2 seconds of context leading to a driving decision (e.g. overtaking, yielding, passing an intersection, …) and its consequences. The causal structure of these scenarios is exposed by consistent textual annotations following a strict template.

Annotation pipeline for the Chain of Causation dataset, source: [1]

The first 10% of the dataset are annotated by humans, while the remainder are annotated by state-of-the-art VLMs like GPT5 to scale the labeling process. Once again, significant efforts are deployed to ensure the consistency, quality and correctness of these human and AI annotations.

Examples of chain of causation reasoning produced by AR1, source: [1]

Trajectory Decoder

The last step of the forward pass consists in decoding the reasoning traces into a 64 point trajectory. While trajectories are usually decoded as a sequence of waypoints (x-y coordinates), the Nvidia team found that using unicycle dynamics (i.e. generating a sequence of acceleration values and steering angles) produced more consistent results. In particular, it facilitates the learning task by preventing the model from predicting physically impossible trajectories (e.g. point t being too far from point t+1).

Interestingly, the authors adopt a dual representation of the trajectory where the model auto-regressively generates discrete tokens during training and uses flow-matching to generate a continuous trajectory at inference time. The main reasons behind this design are as follows:

Joint Action-Reasoning Token Space: Using discrete action tokens allows for a tighter coupling between reasoning traces and actions. When the model generates a reasoning trace, the next tokens in the sequence are (acceleration and curvatures) are mathematically linked to that explanation, preventing hallucinations.
Facilitating RL Optimisation: Restricting the set of possible action tokens to a discrete set makes RL optimisation significantly easier. Indeed, sampling the correct token from a discrete vocabulary (e.g. ACCEL_NEG_2) is significantly easier than providing a gradient for a continuous value like -2.145 m/s^2. As we'll see in the next section, this enables RL post-training, which is crucial to improve the model's safety and consistency.
Stronger Supervisory Signal: Using a cross-entropy loss on discrete tokens acts like a classification task and better captures the multi-modality (e.g. the distinct probability of turning left or right) than an MSE loss on coordinates.
Flow Matching for Inference: While discrete tokens are great for learning, they typically result in jerky trajectories. Moreover, generating a sequence of 128 tokens auto-regressively is too slow for real-time inference. To address those limitations, the authors introduce an action expert: a smaller variant of the main architecture using the KV cache (which contains visual tokens, historical motions and reasoning traces) to decode a continuous trajectory in one pass using flow-matching diffusion. This is one of the main reasons why AR1 can run at such low latency.

Latency benchmark for several AR1 variants, generating trajectories via flow-matching saves close to 200ms at inference time. Source: [1]

Supervised Fine-Tuning and RL Post-Training

Multi-stage training pipeline for the Cosmos-Reason backbone and the AR1 architecture, source: [1]

In order to transform the VLM backbone into a performant driving policy, it undergoes supervised fine-tuning (SFT) on the chain of causation dataset. Specifically, it learns to reproduce the reasoning traces and associated ground-truth actions by maximising the log-likelihood of the action-reasoning sequence:

Supervised Fine-Tuning loss, made by the author

However, SFT on its own is not enough. VLMs are notoriously suffering from discrepancies between their reasoning and predicted actions. The static nature of open-loop datasets allows the model to mimic reasoning traces, but the lack of environmental feedback prevents them from truly internalising causal reactions.

Fortunately, RL post-training helps alleviate those limitations by providing inference feedback on the model’s rollouts. In this paper, the authors use RL for three main purposes:

Improving reasoning quality: a large reasoning model (e.g. DeepSeek-R1) evaluates AR1’s reasoning traces to ensure there are no inconsistencies or hallucinations and assigns a discrete reward on a scale of 0 to 5 accordingly. While DeepSeek is not expected to be able to generate high-quality reasoning traces for driving, it is significantly easier to evaluate AR1’s reasoning, this is known as the generation-verification gap.
Enforcing reasoning-action consistency: the authors extract meta-actions (accelerate, steer, go straight, …) from the CoC dataset using rule-based systems. If those meta-actions correspond to those mentioned in the reasoning traces, the model receives an additional reward of 1, otherwise 0.
Trajectory Quality: a trajectory reward measures the L2 distance between the predicted and expert trajectory, penalises trajectories leading to collisions and high-magnitude jerks.

During post-training, AR1 generates multiple parallel rollouts and collects rewards r_i based on the three reward signals above. These rewards are then used to compute the GRPO loss [4]. GRPO computes the advantage of each rollout relative to the group average. This baseline-free approach (as opposed to other RL algorithms like PPO), stabilises training by rewarding reasoning paths that outperform their counterparts for the same input, rather than relying on an arbitrary absolute score.

GRPO loss, made by the author

All you need to understand about this objective is that it aims to maximise the probability of trajectories (the log term) with a high advantage (the softmax term) relative to others. To avoid losing vision-language priors from the VLM and the driving knowledge obtained during SFT, the objective is regularised by a KL divergence between the current policy and the reference (the policy obtained at the end of SFT).

Evaluation

The evaluation protocol includes 4 sections: Open-loop trajectory prediction, closed-loop simulation, ablation studies and on-vehicle road tests. While the fact that AR1 was deployed in real-world scenarios is impressive, the open and closed-loop results are somewhat opaque in my opinion; the main reason being that they were obtained on Nvidia datasets (closed loop: PhysicalAI-AV dataset, closed-loop: AlpaSim) released at the same time as the model. This implies a lack of baselines to contextualise AR1’s performances.

For instance, the closed-loop results only feature AR1 and a non-reasoning baseline on 75 scenarios. While AR1 outperforms the baseline on all measured metrics, it often does so by a single percent on average and with a much larger variance than the baseline.

Closed-loop results for AR1 and a non-reasoning baseline, source: [1]

For this reason, I would advise taking these results with a grain of salt before other frontier architectures are evaluated in AlpaSim.

Conclusion

Despite the lack of contextualised results, AR1 and the accompanying datasets remain an impressive engineering achievement and a good indication of where autonomous driving is headed: end-to-end models inheriting world knowledge from massive VLMs trained on embodied tasks.

However, the collection of causally-grounded datasets required to enable chain of causation require significant investments and labeling efforts which limits reproducibility until these datasets are made public. In my next article, I’ll contrast the AR1 approach with another state-of-the-art model which entirely disposes textual labels and instead trains VLMs to act and reason in a latent space.

Thank you for reading this far!

If you found this article useful, please consider sharing it; it genuinely helps support the time and effort that goes into producing this work. As always, feel free to if you have questions, thoughts, or ideas for follow-ups. If you’d like to support my independent research and writing, feel free to buy me a coffee 😉

Until next time! 👋

Sources

AlpamayoR1: Large Causal Reasoning Models for Autonomous Driving was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learning Triton One Kernel at a Time: Softmax

Ryan Pégoud — Sat, 27 Dec 2025 09:42:06 GMT

All you need to know to write a fast, readable and PyTorch-ready softmax kernel!

Photo by Nana Dua on Unsplash

In the previous article of this series, we covered an ubiquitous operation in all fields of computer science: matrix multiplication. It is heavily used in neural networks to compute the activation of linear layers. However, activations on their own are difficult to interpret, since their values and statistics (mean, variance, min-max amplitude) can vary wildly from layer to layer. This is one of the reasons why we use activation functions, for example the logistic function (a.k.a sigmoid) which projects any real number in the [0; 1] range.

The softmax function, also known as the normalised exponential function, is a multi-dimensional generalisation of the sigmoid. It converts a vector of raw scores (logits) into a probability distribution over M classes. We can interpret it as a weighted average that behaves as a smooth function and can be conveniently differentiated. It is a crucial component of dot-product attention, language modeling and multinomial logistic regression.

In this article, we’ll cover:
1. Implementing an efficient softmax kernel in Triton.
2. Implementing the backward pass (autograd).
3. Optimisation: cache modifiers and auto-tuning.

If you aren’t familiar with Triton yet, refer to the previous articles!

Disclaimer: all the illustrations and animations are made by the author unless specified otherwise.

Definition

The softmax is defined as follows:

The normalisation ensures that the vector sums to 1, so that it can be interpreted as a valid probability distribution.

Note that this formulation of the softmax is highly sensitive to numerical overflow. Recall that the maximum value a standard float16 can represent is 65 504, which is roughly exp(11). This means that any input value greater than ~11 will result in exp(z_i) exceeding the representable range, leading to overflow.

A common trick to mitigate this issue is to subtract the maximum value of the input vector from every element, such that the new maximum is 0 before exponentiation and 1 after.

Naive Implementation

As you can see, computing the softmax involves two reduction operations, a max and a sum. A naive algorithm require three separate passes over the input vector. First to compute the maximum, then the sum, and finally the normalised outputs.

Here’s what a naive Numpy implementation looks like:

https://medium.com/media/dd6af75416fdde06105c1e6af4eda67e/href

A recurrent theme in this Triton series is minimising high-latency global memory access. Our current Numpy implementation requires three separate memory reads of the full input vector, which is highly inefficient.

Online Softmax

Fortunately, we can use a clever trick, known as the online softmax, to fuse the max and sum steps, reducing the number of memory reads to 2.
First, we define the sum of exponentials recursively. In the following set of equalities, m_i refers to the maximum over x until the i-th index.

This equality allows us to compute the sum of exponentials iteratively using the maximum value so far. We can leverage it to fuse the first and second loop in the naive implementation and compute the maximum and sum of exponentials iteratively.

Our algorithm becomes:

This is easily translated to Numpy:

https://medium.com/media/545c7470df17f08a4a50df16aeae48c1/href

Now that we understand the main principles behind the softmax, we’ll implement it in Triton, starting by the simple, single-block version and building up to the online, multi-block formulation. In the end, we want our kernel to behave like a PyTorch module and be compatible with autograd.

Unfortunately, from PyTorch’s point of view, Triton kernels behave like black boxes: the operations they perform are not traced by autograd. This requires us to implement the backward pass ourselves and explicitly specify how gradients should be computed. Let’s brush up on our beloved chain rule and derive the softmax gradient.

Gradient

Since the outputs of the softmax are strictly positive, we can use the logarithmic derivative to make the derivation of the gradient easier. Here, we take the derivative of the log of the output and apply the chain rule:

From there, we rearrange the terms and follow these steps:

Now assume that we have some upstream gradient, for example generated by a loss function L (e.g. a cross-entropy loss). We get the following expression of the gradient:

The simplification of the left term in (9) is due to the fact that δ_ij will only be equal to 1 for the i-th element, collapsing the sum over j to a single term.

Triton Implementation

Single Block Softmax

Now that we worked through the derivation of the gradient, we can write the forward and backward softmax kernels. First, let’s focus on the PyTorch wrapper to understand how the single block implementation works at a high level. Given a 2D input tensor, the forward and backward kernels are going to process all rows in parallel.
For simplicity, we’ll define the BLOCK_SIZE to be large enough to handle all columns at once. Specifically, we’ll set it as the next power of 2 superior to the number of columns, as required by Triton.
Then, we’ll define our `grid` to be the number of rows (it could potentially also handle a batch dimension).

The PyTorch wrapper for our SoftmaxSingleBlock is a class inheriting from torch.autograd.Function that implements forward and backward. Both methods take a ctx argument, which we’ll use to cache the softmax outputs during the forward pass and reuse them during the backward pass.

https://medium.com/media/2bdeb4343db215d190ab46912a739ddf/href

Both kernels are pretty straightforward, we start by loading the row inputs using the same syntax as in my previous vector addition article. Notice that BLOCK_SIZE and num_warps are computed using a calculate_settings function. This function comes from the Unsloth library and was reused in other kernel libraries such as LigerKernel (which the kernels in this article are loosely based on), it provides heuristics to tune both variables:

def calculate_settings(n: int) -> tuple[int, int]:
 MAX_FUSED_SIZE = 65536 # maximum grid dimension on Nvidia GPUs
    BLOCK_SIZE = next_power_of_2(n)
    if BLOCK_SIZE > MAX_FUSED_SIZE:
        # we remove this assertion in this article
        raise RuntimeError(
            f"Cannot launch Triton kernel since n = {n} exceeds "
            f"the maximum CUDA blocksize = {MAX_FUSED_SIZE}."
        )
    num_warps = 4
    if BLOCK_SIZE >= 32768:
        num_warps = 32
    elif BLOCK_SIZE >= 8192:
        num_warps = 16
    elif BLOCK_SIZE >= 2048:
        num_warps = 8
    return BLOCK_SIZE, num_warps

Then, we implement the regular softmax for the forward pass and equation (10) for the backward pass. The only novelty here compared to previous articles is the use of cache modifiers, which tell the compiler how to cache and evict data. For now, we’ll only focus on three cache modifiers:

.ca (Cache at all levels): Tells the compiler to load the data in both L1 and L2 cache, suggesting that it might be reused soon. This modifier should be used when the data is small enough to fit into L1 (~128–192KB per SM on an A100) and will likely be accessed repeatedly.
.cs (Streaming): Treat data as streaming, it will be used once and then discarded to free up space in L1.
.wb (Write-back): Normal cached write, the data will remain in the cache hierarchy, good if the output may be reused.

In the following kernels, we’ll use the .ca modifier for loads since we perform multiple operations on the loaded data. For storing, we’ll use .cs in the forward pass, since the outputs won’t be immediately reused and .wb in the backward pass since in the context of autograd (i.e. the chain rule), gradient outputs will be consumed by downstream kernels.

https://medium.com/media/c1c5a1b9386d7b398a941def96172140/href

Multi-Block Softmax

Now, let’s take a look at the online formulation of the softmax. In this section, we implement a multi-block variant of the previous kernel. This version will use BLOCK_SIZE < n_cols, in other words, we’ll only load a tile with BLOCK_SIZE elements at a time, similar to how we handled tiled GEMM in the last tutorial. Now you might ask “how do we select the block size?”.

This is a great occasion to introduce Triton’s autotune utility. Provided with a list of configuration, autotune will perform a grid-search to determine and cache the best configuration for a specific input shape. This process is repeated every time a new input shape is passed to the kernel.
Here, we perform a grid search over the block size and number of warps using the following utility function:

from itertools import product

# --- Multi Block Tuning ---
BLOCK_SIZES = [256, 512, 1024, 2048, 4096, 8192]
NUM_WARPS = [2, 4, 8, 16]


def get_autotune_config(
    block_sizes: list[int], num_warps: list[int]
) -> list[triton.Config]:
    return [
        triton.Config(kwargs={"BLOCK_SIZE": bs}, num_warps=nw)
        for (bs, nw) in list(product(block_sizes, num_warps))
    ]

We can now decorate our multi-block kernels with autotune and pass the list of configs, key=”n_cols” indicates that the optimal config is dependent on the number of columns of the input.
The implementation of these kernels is conceptually very close to the online softmax we covered before, the main differences is that we iterate over tiles (not over single elements like in Numpy), which requires some adjustments. For instance, we add a sum over the tile in the d update and the backward kernel now requires two iterations as well.

https://medium.com/media/afbb63ac1deb7ad9b28a617d941abcfd/href

Testing and Benchmarking

We can now execute a forward and backward pass with both kernels and ensure they match the PyTorch baselines:

def validate_kernel(kernel_fn: callable) -> None:
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    torch.random.manual_seed(0)

    # Generate inputs
    x = torch.randn((256, 512), device=device) # triton input
    x.requires_grad = True
    xt = deepcopy(x) # torch input

    triton_output = kernel_fn(x)
    torch_output = torch.softmax(xt, dim=1)
    torch.testing.assert_close(triton_output, torch_output) # test fwd kernel

    # Setup fake labels
    y = torch.zeros_like(x)
    inds = (torch.arange(0, y.shape[0]), torch.randint(0, 3, (y.shape[0],)))
    y[inds] = 1

    # Define loss and run backward pass
    loss_fn = torch.nn.CrossEntropyLoss()
    loss = loss_fn(torch_output, y)
    loss.backward()

    # Save gradient tensor for later
    torch_xgrad = xt.grad.detach().clone()
    triton_loss = loss_fn(triton_output, y)
    triton_loss.backward()
    torch.testing.assert_close(x.grad, torch_xgrad) # test grad outputs

validate_kernel(softmax_sb)
validate_kernel(softmax_mb)

Finally, we benchmark our implementation against the PyTorch baseline using the following snippet:

# --- Source: Triton softmax tutorial ---
@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=["N"],  # argument names to use as an x-axis for the plot
        x_vals=[
            128 * i for i in range(2, 100)
        ],  # different possible values for `x_name`
        line_arg="provider",  # argument name whose value corresponds to a different line in the plot
        line_vals=[
            "triton_single_block",
            "triton_multi_block",
            "torch",
        ],  # possible values for `line_arg``
        line_names=[
            "Triton_single_block",
            "Triton_multi_block",
            "Torch",
        ],  # label name for the lines
        styles=[("blue", "-"), ("green", "-"), ("red", "-")],
        ylabel="GB/s",  # label name for the y-axis
        plot_name="softmax-performance",  # name for the plot. Used also as a file name for saving the plot.
        args={"M": 4096},  # values for function arguments not in `x_names` and `y_name`
    )
)
def benchmark(M, N, provider):
    x = torch.randn(M, N, device=DEVICE, dtype=torch.float32)
    stream = getattr(torch, DEVICE.type).Stream()
    getattr(torch, DEVICE.type).set_stream(stream)
    if provider == "torch":
        ms = triton.testing.do_bench(lambda: torch.softmax(x, axis=-1))
    if provider == "triton_single_block":
        torch.cuda.synchronize()
        ms = triton.testing.do_bench(lambda: softmax_sb(x))
        torch.cuda.synchronize()
    if provider == "triton_multi_block":
        torch.cuda.synchronize()
        ms = triton.testing.do_bench(lambda: softmax_mb(x))
        torch.cuda.synchronize()
    gbps = lambda ms: 2 * x.numel() * x.element_size() * 1e-9 / (ms * 1e-3)
    return gbps(ms)


benchmark.run(show_plots=True, print_data=True)

Good news! Our single-block kernel consistently outperforms the PyTorch baseline while the multi-block variant falls off for inputs with more than 6k columns:

Considering larger inputs, we can make several observations:

The multi-block kernel eventually stabilises around 900GB/s of throughput, surpassing the PyTorch baseline for inputs with more than 30k columns.
Interestingly, it seems like the multi-block variant will dominate for inputs with more than 60k columns.
Even though we exceed the maximum block size with the single-block variant, the kernel still runs smoothly for some reason. Indeed, Triton automatically manages the block size under the hood.
When n_cols is larger than the hardware limit, Triton will break down the input and iterate over it. However, this seems to be slower than the multi-block approach.

To go further, we could combine both approaches in a single kernel that explicitly selects the optimal kernel based on the input size. This way, we would benefit from the high performance of the single-block kernel for small inputs and the higher throughput of the multi-block variant for inputs with more than 60k columns.

This concludes the third episode of this Triton series, thanks again for your support!

In the next article, we’ll leverage the online softmax formulation in the context of Flash Attention.

Until next time! 👋

Resources:

LigerKernel Softmax Implementation
Softmax Gradient derivation by Thomas Kurbiel
GPU kernel optimization: Softmax — Part 2 by Hugo Rosenkranz-costa (Cuda & Triton kernels with more emphasis on profiling and hardware optimisation)
From online softmax to FlashAttention by Zihao Ye

Learning Triton One Kernel at a Time: Softmax was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learning Triton One Kernel at a Time: Matrix Multiplication

Ryan Pégoud — Fri, 14 Nov 2025 11:24:36 GMT

Photo by Lucas Kepner on Unsplash

Matrix multiplication is undoubtedly the most common operation performed by GPUs. It is the fundamental building block of linear algebra and shows up across a wide spectrum of different fields such as graphics, physics simulations and scientific computing while being ubiquitous in machine learning.

In today’s article, we’ll break down the conceptual implementation of general matrix-matrix multiplication (GEMM) while introducing several optimisation concepts such as tiling and memory coalescing. Finally, we’ll implement GEMM in Triton!

This article is the second of a series on Triton and GPU kernels, If you are not familiar with Triton or need a refresher on GPU basics, check out the previous article! All the code showcased in this article is available on GitHub.

Learning Triton One Kernel At a Time: Vector Addition | Towards Data Science

Disclaimer: all the following figures and animations were made by the author unless stated otherwise.

Parallel Tiled GEMM, as we’ll implement in this article!

Naive GEMM

Let’s start simple: we want to multiply two matrices X and Y with shapes (M,N) and (N,K) respectively. The output matrix Z=X@Y will therefore have shape (M,K).

This operation involves computing the dot products of all pairs of rows and columns in X and Y respectively. A straightforward NumPy implementation might look something like this:

https://medium.com/media/e77f6c75d006a84ded3aa31b3362bf02/href

While easy to write, read and understand, this implementation is highly inefficient in terms of memory access and caching. As mentioned in the first article of this series, a fundamental aspect of GPU optimisation is minimising data transfers.

However, our current implementation starts by loading a row from X, iteratively loads all K columns of Y, computes their dot product and repeats the process for every row in X. This results in a total of M(K+1) loading operations.

Naive Matrix Multiplication, purple and blue tiles represent the vectors involved in dot products at every time step and green cells the computed output values.

As seen in the animation, the memory access pattern is wasteful, as every column of Y is loaded M times. As an analogy: this is like running to the grocery store (global memory) every time you need a new ingredient for a dish instead of preparing all the ingredients on your kitchen counter (shared memory). Ideally, we would like to minimise the number of times each chunk of data is loaded and maximise its reusability once loaded. This leaves us with two main axes of optimisation:

How can we improve the access pattern to minimise redundant loads?
How much data can we load at once, and where should it be stored on the GPU?

Tiled GEMM

As mentioned previously, the naive approach to GEMM results in many redundant loads, which induces unnecessary overhead. Ideally, we’d like to load each segment of data only once and perform all the operations in which they are used before dropping them from memory.

An elegant approach to this problem is tiling, which involves dividing large matrices in smaller “tiles” or sub-matrices. Consider two matrices X and Y with shapes (4,6) and (6,4) respectively, X@Y results in a matrix Z with shape (4,4).

In order to compute the first element of Z, Z[0,0], we need to compute the dot product between the first row of X and the first column of Y: Z[0,0] = dot(X[0, :], Y[:, 0]). We can also break down the dot product into smaller chunks, for instance in groups of 3 elements: Z[0,0] = dot(X[0,0:3], Y[0:3, 0]) + dot(X[0,3:6], Y[3:6, 0]).

Alternatively, we can expand this approach to two dimensions and compute an entire (2,2) block of Z at a time: Z[0:2, 0:2] = dot(X[0:2, 0:2], Y[0:2, 0:2]) + dot(X[0:2, 2:4], Y[2:4, 0:2]) + dot(X[0:2, 4:6], Y[4:6, 0:2]).

Here’s a visual representation of tiled matrix multiplication:

Tiled Matrix Multiplication. The computation is split in several “tiles” of X and Y (highlighted in pale blue and purple), each containing several blocks (dark blue and purple). In each block, we compute dot products (green cells in X and Y). These dot products are accumulated across the blocks of a tile to compute the output values in Z (the accumulation is represented by colors from orange to green).

The above animation illustrates how data is reused in tiled GEMM. For each 2x2 block in X and Y, we compute 4 dot products, which results in a (2,2) output matrix in Z. Since each tile contains 3 blocks, we need to accumulate 3 of these matrices to compute the final (2,2) output in Z. This accumulation is represented by colored cells in Z.

In the kitchen analogy, this is like fetching ingredients from the store and preparing them on the kitchen counter (i.e. small shared memory), reusing them several times before going back to the store.

Importantly, reusing loaded data over multiple steps allows this approach to drastically reduce the number of load operations. For (2,2) blocks, each X row and Y column is used in two dot products. Therefore, we’re performing twice as many operations with each block of loaded data, roughly halving the number of load operations! Note that this generalises to larger blocks as well, using a (32,32) block would reduce the number of loads by a factor of around 32.

Now you’re probably wondering “how large can these blocks be”? To answer this question, let’s recall how memory is managed in modern GPUs.

GPU Memory Hierarchy

We distinguish four main types of memory in Nvidia GPUs. Here, we take the example of an A100:

Registers: The fastest and smallest type of memory on the GPU, residing directly within each Streaming Multiprocessor (SM). On the A100, each SM provides 256 KB of register file space (65,536 × 32-bit registers), distributed among its threads. Each thread gets its own private 32-bit registers for storing temporary variables and intermediate results, avoiding memory traffic altogether. However, register usage per thread directly affects occupancy, as using too many registers per thread limits how many threads can run concurrently.
L1/Shared Memory: On an A100, each SM has 192KB of SRAM that can be flexibly configured as either a hardware-managed L1 cache or a programmer-managed shared memory. For performance-critical kernels like matrix multiplication, we explicitly use this space as shared memory to stage data tiles close to the compute units, bypassing the L1 cache entirely. This gives us fine-grained control over data reuse.
L2 cache: This cache is slower than L1 but much larger, with around 40 MB shared across all SMs on the A100. It serves as a global cache for both data and instructions, reducing the number of accesses to high-latency HBM memory. The L2 cache is coherent across SMs, meaning that updates from one SM are visible to others, enabling synchronisation between thread blocks. Its bandwidth can reach several terabytes per second, acting as a buffer between the fast on-chip SRAM and the slower HBM.
High Bandwidth Memory (HBM): This is the device memory, it has a capacity of either 40GB or 80GB depending on the A100 model. It provides extremely high bandwidth (up to 2 TB/s on the 80 GB variant) but with much higher latency than on-chip caches. HBM is where large tensors, model weights, and datasets reside during execution. Since accessing HBM is expensive, efficient kernels aim to minimise data movement and maximise on-chip data reuse via registers and shared memory.

As you can see, the memory hierarchy generally trades off capacity with latency. Therefore, maximising performance boils down to loading data from HBM into shared memory efficiently and reusing it as much as possible.

GPU Memory Hierarchy, from fastest/smallest (top) to slowest/largest (bottom).

Choosing our block size is critical. We want blocks to be large enough to create a lot of parallel work, but small enough that their data fits in the SM’s shared memory and registers. A BLOCK_SIZE of 64 is a common starting point because it's a multiple of the warp size (32 threads), ensuring full hardware utilisation.

Parallel Tiled GEMM

With these considerations in mind, a natural follow-up to our tiled GEMM is to parallelise the computation of each pairs of tiles over several thread blocks, as depicted on the following animation.

Parallel Tiled Matrix Multiplication. The iteration over tiles is replaced by a parallel operation over multiple thread blocks.

Memory Coalescing

Before writing tiled GEMM in Triton, we need to consider one last detail: memory coalescing, a technique that allows optimal use of global memory bandwidth. Memory coalescing is achieved when subsequent threads in a warp access subsequent memory addresses. Imagine a librarian needing to fetch books for a client, if all books are side-by-side on a shelf, they can grab them all at once. In contrast, if all books are lying on different shelves, they’ll have to grab them one by one, which takes significantly longer.

To understand how this applies to our case, note that matrices are stored linearly in memory, in other words a (2,2) matrix is stored as a sequence of 4 consecutive elements. Frameworks like PyTorch adopt a row-major layout, meaning that elements of a matrix are per-row contiguous in memory. For instance, elements of our (2,2) matrix would be stored as follows: [(0,0), (0,1), (1,0), (1,1)], notice that elements of the same row are contiguous (touching) while elements of the same column have a stride of 1 (separated by one element).

PyTorch stores matrices in row-major layout. Elements of a row contiguous in memory while elements of a column are strided.

This implies that we can load rows using coalesced loads, but columns do not satisfy this condition. However, we need to access columns of Y to compute dot products. In order to maximise performance, a good practice is to transpose Y so that we iterate on its rows rather than its columns.

However, transposing Y isn’t enough to modify its layout in memory. As mentioned previously, PyTorch stores matrices in a flat array. Each matrix dimension is associated with a stride attribute, denoting the jump necessary to go from one element to the next one along this dimension. For instance, a (10,10) matrix would have strides=(10,1). Indeed, starting from element [0,0], element [1,0] is 10 memory slots (i.e. one row) away, whereas element [0,1] is adjacent.

When transposing a tensor, PyTorch doesn’t modify the layout in memory but simply recomputes the strides. In order to make the transpose effective from a memory standpoint we need to call Y.T.contiguous().

These are the required steps the load columns of Y efficiently, however we’ll need to transpose the loaded blocks within the kernel to perform the dot product properly: z_block = tl.dot(X_block, Y_block.T).

Representation of Y, Y.T and Y.T.contiguous() in their block representation and memory layout. The transpose operation changes the behaviour of the matrix but doesn’t modify its memory layout. This is why we need to add .contiguous() to enable coalesced reads on rows.

Triton Implementation

From here on, we first describe the kernel without memory coalescing to simplify the logic and pointer arithmetic before summarising the changes required to make the load operations coalesced on Y columns.

Let’s start by focusing on the PyTorch wrapper around the kernel. We need to read M, N, K from the input matrices and compute their strides since these constants will be useful later in the kernel. Then, we define the BLOCK_SIZE and declare the grid.

https://medium.com/media/2a750f12b2cc0a271e45fe730a684d25/href

Now let’s dive into the actual kernel code. We’re going to make use of Triton’s make_block_ptr utility, which simplifies the pointer arithmetic. We create one block pointer per matrix and pass the matrix shape, its strides, and the size of the block as inputs. Additionally, we specify the offset, the coordinate of the top-left element in the current block. For X, this corresponds to (m_idx * BLOCK_SIZE, 0) where m_idx is the index of the current block along the M dimension.

From there, we define z_acc, a zero matrix that will receive the partial dot-products as we iterate through tiles. We now iterate through the shared dimension N, loading blocks of size (BLOCK_SIZE, BLOCK_SIZE), and accumulate their dot products in z_acc. We then move the block pointers along the shared dimension by using .advance.

You might have noticed that when loading data, we use boundary_check and padding_option instead of mask and other as in the previous article. These arguments are specific to the use of block pointers and specify which axes to check for out-of-bound operations (here (0,1) for x and y) and how to treat those invalid values. Here we set them to zero to be ignored in the dot product.

https://medium.com/media/15d95a3631676f7d2551882ad5d6ddb7/href

We can now take a look at the performance of this kernel by using the following function:

def bench(fn: callable, x: torch.Tensor, y: torch.Tensor, repeat: int):
  flops = []
  med_latency = []

  for _ in tqdm(range(repeat), desc=f"Benchmarking {fn.__name__}"):
    latency_ms = triton.testing.do_bench(
      lambda: fn(x, y),
      quantiles=[0.5], # get the median latency
      return_mode="all",
      )
    n_flops = 2 * M * N * K # matmul roughly requires 2*M*N*K operations
    tflops = n_flops / (latency_ms / 1e3) / 1e12

    med_latency.append(latency_ms)
    flops.append(tflops)

  flops = np.array(flops)
  med_latency = np.array(med_latency)
  print(f"Absolute Error: {torch.sum(torch.abs(X@Y - fn(x, y)))}")
  print(f"Median Latency: {med_latency.mean():.4f} ± {med_latency.std():.3f} ms")
  print(f"Throughput: {flops.mean():.4f} ± {flops.std():.3f} TeraFLOPS")


M = 8192
N = 6144
K = 4096

X = torch.randn((M, N), device="cuda", dtype=torch.float32)
Y = torch.randn((N, K), device="cuda", dtype=torch.float32)

bench(block_matmul, X, Y, repeat=10)

We get the following outputs (using a T4 GPU on Colab):

Absolute Error: 0.0 # the kernel outputs the correct result!
Median Latency: 130.7831 ± 1.794 ms
Throughput: 3.1533 ± 0.043 TeraFLOPS

Now let’s review the changes required for coalesced loads on Y: we mainly need to flip the shape, strides and offsets when defining the block pointer for Y. Additionally, we update the block pointer to move along the column dimension (previously row dimension). The full code for this implementation is available on GitHub.

@triton.jit
def coalesced_block_matmul_kernel(
    X_ptr, X_m_stride, X_n_stride,
    Y_ptr, Y_k_stride, Y_n_stride,
    Z_ptr, Z_m_stride, Z_k_stride,
    M, N, K,
    BLOCK_SIZE: tl.constexpr,
):
    ... 
    y_block_ptr = tl.make_block_ptr(
        base=Y_ptr,
        # flip the shape, strides and offsets to match Y.T
        shape=(K, N),
        strides=(Y_k_stride, Y_n_stride), 
        offsets=(k_idx * BLOCK_SIZE, 0),
        block_shape=(BLOCK_SIZE, BLOCK_SIZE),
        order=(0, 1),
    )
    ...

    for _ in range(0, N, BLOCK_SIZE):
        ... # loads
        z_acc += tl.dot(x, y.T)  # transpose Y back for dot product
        x_block_ptr = tl.advance(x_block_ptr, offsets=(0, BLOCK_SIZE))
        # advance the block pointer along columns of Y.T (i.e rows of Y)
        y_block_ptr = tl.advance(y_block_ptr, offsets=(0, BLOCK_SIZE))

    tl.store(pointer=z_block_ptr, value=z_acc, boundary_check=(0, 1))

def coalesced_block_matmul(X, Y):
    Y = Y.T.contiguous()  # Y is now (K,N)
    M, N = X.shape
    K, _ = Y.shape
    Z = torch.empty((M, K), device="cuda")

    x_stride_m, x_stride_n = X.stride()
    y_stride_k, y_stride_n = Y.stride()
    z_stride_m, z_stride_k = Z.stride()

    ...  # define BLOCK_SIZE and grid

    coalesced_block_matmul_kernel[grid](
        X, x_stride_m, x_stride_n,
        Y, y_stride_k, y_stride_n,
        Z, z_stride_m, z_stride_k,
        M, N, K,
        BLOCK_SIZE,
    )

    return Z

Here are the results of our benchmark for the kernel with coalesced loads for Y:

Absolute Error: 0.0 # Again, the kernel is correct!
Median Latency: 261.9420 ± 0.858 ms
Throughput: 1.5741 ± 0.005 TeraFLOPS

Surprisingly, the throughput of this second kernel is only half of what we obtained with the first one, despite improving the efficiency of load operations 🤔

A quick inspection using nsight (Nvidia’s kernel profiler, more on that in a future article) reveals that the transpose operation within the kernel creates a “traffic jam”. Specifically, the transpose creates bank conflicts, causing threads to remain idle most of the time. Notably, the warp scheduler has no eligible warp to dispatch 87.6% of the time as they are waiting for the bank conflict to resolve. Additionally, the report reads:

----------------------- ----------- --------------
Metric Name             Metric Unit   Metric Value
----------------------- ----------- --------------
...
DRAM Throughput                   %           8.20
Compute (SM) Throughput           %          21.14
...

This indicates that the kernel is latency bound (i.e. neither memory nor compute bound, refer to the previous article for more details). In contrast, the first kernel is compute bound (i.e. increasing compute will improve performance) since the compute throughput is high compared to the DRAM throughput.

----------------------- ----------- --------------
Metric Name             Metric Unit   Metric Value
----------------------- ----------- --------------
...
DRAM Throughput                   %          29.35
Compute (SM) Throughput           %          74.39
...

Conclusion

This experiment highlights the importance of profiling and empirical validation. Even well-intentioned optimisations like coalescing memory accesses can introduce new bottlenecks if not evaluated carefully. The first kernel, though simpler, was compute-bound and better matched the hardware characteristics.

In the next articles of this series, we’ll implement a softmax kernel, paying particular attention to integrating Triton with PyTorch's autograd and profiling kernels using Nsight.

Until next time! 👋

Useful Resources

Learning Triton One Kernel at a Time: Matrix Multiplication was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learning Triton One Kernel At a Time: Vector Addition

Ryan Pégoud — Wed, 29 Oct 2025 14:20:54 GMT

The basics of GPU programming, optimisation, and your first Triton kernel!

Photo by Igor Omilaev on Unsplash

In the era of billion-parameter models, a little optimisation goes a long way. Models like GPT4 cost more than $100 millions to train, which makes a 1% efficiency gain worth over a million dollars. A powerful way to optimise the efficiency of machine learning models is by writing some of their components directly on the GPU. Now if you’re anything like me, the simple mention of CUDA kernels is enough to send chills down your spine, as they are notoriously complex to write and debug.
Fortunately, OpenAI released Triton in 2021, a new language and compiler abstracting away much of CUDA’s complexity and allowing less experienced practitioners to write performant kernels. A notable example is Unsloth, an LLM-training service that promises 30x faster training with 60% less memory usage, all thanks to replacing layers written in PyTorch with Triton kernels.
In this tutorial series, we’ll learn the basics of GPU architecture and how to implement high-performance Triton kernels!

GPU Architecture Basics

In this section, we’ll go through the very basics of (Nvidia) GPUs to get us started and write our first Triton kernel by the end of this article.
Starting from the smallest software unit, we can describe the hierarchy of execution units as follows:

Threads: The smallest unit of work, they run the user-defined kernel code.
Warps: The smallest scheduling unit, they are always composed of 32 parallel threads, each with their own instruction address counter and register state. Threads in a warp start together but are free to branch and execute independently.
Thread Blocks: Group of warps, where all threads can cooperate via shared memory and sync barriers. It is required that thread blocks can execute independently and in any order, in parallel or sequentially. This independence allows thread blocks to be scheduled in any order across any number of cores, so that GPU programs scale efficiently with the number of cores. We can synchronise the threads within a block at specific points in the kernel if needed, for example to synchronise memory access.
Streaming Multiprocessor (SM): A unit in charge of executing many warps in parallel, it owns shared memory and an L1 cache (holds the most recent global-memory lines that the SM has accessed). An SM has a dedicated warp scheduler that pull warps from the thread blocks that are ready to run.

On the hardware side, the smallest unit of work is a CUDA core, the physical Arithmetic Logic Unit (ALU) which performs arithmetic operations for a thread (or parts of it).

To summarise this section with an analogy, we could see CUDA cores as individual workers, while a warp is a squad of 32 workers given the same instruction at once. They may or may not execute this task the same way (branching) and can potentially complete it at a different point in time (independence). A thread block is composed of several squads sharing a common workspace (i.e. have shared memory), workers from all squads in the workspace can wait for each other to get lunch at the same time. A streaming multiprocessor is a factory floor with many squads working together and sharing tools and storage. Finally, the GPU is a whole plant, with many floors.

Hierarchy of an Nvidia GPU architecture.

Optimisation Basics

When optimising deep learning models, we are juggling with three main components:

Compute: Time spent by the GPU computing floating point operations (FLOPS).
Memory: Time spent transferring tensors within a GPU.
Overhead: All other operations (Python interpreter, PyTorch dispatch, …).

Keeping those components in mind helps figuring out the right way to resolve a bottleneck. For instance, increasing compute (e.g. using a more powerful GPU) doesn’t help if most of the time is spent doing memory transfers. Ideally though, most of the time should be spent on compute, more precisely on matrix multiplications, the precise operation GPUs are optimised for.
This implies minimising the cost paid to move data around, either from the CPU to the GPU (”data transfer cost”), from one node to the other (”network cost”) or from CUDA global memory (DRAM, cheap but slow) to CUDA shared memory (SRAM, expensive but fastest on-device memory). The later is called bandwidth costs and is going to be our main focus for now. Common strategies to reduce bandwidth costs include:

Reusing data loaded in shared memory for multiple steps. A prime example of this is tiled matrix multiplication, which we’ll cover in a future post.
Fusing multiple operations in a single kernel (since every kernel launch implies moving data from DRAM to SRAM), for instance we can fuse a matrix multiplication with an activation function. Generally, operator fusion can provide massive performance increase since it prevents a lot of global memory reads/writes and any two operators present an opportunity for fusion.

Matrix multiplication followed by a ReLU activation without operator fusion.

In this example, we perform a matrix multiplication x@W and store the result in an intermediate variable a. We then apply a relu to a and store the result in a variable y. This requires the GPU to read from x and W in global memory, write the result in a, read from a again and finally write in y. Instead, operator fusion would allow us to halve the amount of reads and writes to global memory by performing the matrix multiplication and applying the ReLU in a single kernel.

Fused matrix multiplication and ReLU activation.

Triton

We’ll now write our first Triton kernel, a simple vector addition. First, let’s walk through how this operation is broken down and executed on a GPU.

Consider wanting to sum the entries of two vectors X and Y, each with 7 elements (n_elements=7).
We’ll instruct the GPU to tackle this problem in chunks of 3 elements at a time (BLOCK_SIZE=3). Therefore, to cover all 7 elements of the input vectors, the GPU will launch 3 parallel “programs”, independent instance of our kernel, each with a unique program ID, pid:

Program 0 is assigned elements 0, 1, 2.
Program 1 is assigned elements 3, 4, 5.
Program 2 is assigned element 6.

Then, these programs will write back the results in a vector Z stored in global memory.
An important detail is that a kernel doesn’t receive an entire vector X, instead it receives a pointer to the memory address of the first element, X[0]. In order to access the actual values of X, we need to load them from global memory manually.
We can access the data for each block by using the program ID: block_start = pid * BLOCK_SIZE. From there, we can get the remaining element addresses for that block by computing offsets = block_start + range(0, BLOCK_SIZE) and load them into memory.
However, remember that program 2 is only assigned element 6, but its offsets are [6, 7, 8]. To avoid any indexing error, Triton lets us define a mask to identify valid target elements, here mask = offsets < n_elements.
We can now safely load X and Y and add them together before writing the result back to an output variable Z in global memory in a similar way.

Per-block vector indexing. Slices of X, Y and Z are sent to independent thread blocks, each indexed by a unique ID.

Let’s take a closer look at the code, here’s the Triton kernel:

https://medium.com/media/fe9f4ffa10c444708170314dca784eff/href

Let’s break down some of the Triton-specific syntax:

First, a Triton kernel is always decorated by @triton.jit.
Second, some arguments need to be declared as static, meaning that they are known at compute-time. This is required for BLOCK_SIZE and is achieved by add the tl.constexpr type annotation. Also note that we do not annotate other variables, since they are not proper Python variables.
We use tl.program_id to access the ID of the current block, tl.arange behaves similarly to Numpy’s np.arange.
Loading and storing variables is achieved by calling tl.load and tl.store with arrays of pointers. Notice that there is no return statement, this role is delegated to tl.store.

To use our kernel, we now need to write a PyTorch-level wrapper that provides memory pointers and defines a kernel grid. Generally, the kernel grid is a 1D, 2D or 3D tuple containing the number of thread blocks allocated to the kernel along each axis. In our previous example, we used a 1D grid of 3 thread blocks: grid = (3, ).
To handle varying array sizes, we default to grid = (ceil(n_elements / BLOCK_SIZE), ).

https://medium.com/media/08d85e2be936f48da9feba1f04b26123/href

Here are two final notes about the wrapper:

You might have noticed that grid is defined as a lambda function. This allows Triton to compute the number of thread blocks to launch at launch time. Therefore, we compute the grid size based on the block size which is stored in meta, a dictionary of compile-time constants that are exposed to the kernel.
When calling the kernel, the value of output will be modified in-place, so we don’t need to reassign output = add_kernel[…].

We can conclude this tutorial by verifying that our kernel works properly:

x, y = torch.randn((2, 2048), device="cuda")

print(add(x, y))
>> tensor([ 1.8022, 0.6780, 2.8261, ..., 1.5445, 0.2563, -0.1846], device='cuda:0')

abs_difference = torch.abs((x + y) - add(x, y))
print(f"Max absolute difference: {torch.max(abs_difference)}")
>> Max absolute difference: 0.0

That’s it for this introduction, in following posts we’ll learn to implement more interesting kernels such as tiled matrix multiplication and see how to integrate Triton kernels in PyTorch models using autograd.

Until next time! 👋

References and Useful Resources

Learning Triton One Kernel At a Time: Vector Addition was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

Rainbow: The Colorful Evolution of Deep Q-Networks

Ryan Pégoud — Fri, 12 Jul 2024 21:22:16 GMT

Everything you need to assemble the DQN Megazord in JAX.

“The Rainbow Megazord”, Dall-E 3

In 2013, the introduction of Deep Q-Networks (DQN) by Mnih et al.[1] marked the first breakthrough in Deep Reinforcement Learning, surpassing expert human players in three Atari games. Over the years, several variants of DQN were published, each improving on specific weaknesses of the original algorithm.

In 2017, Hessel et al.[2] made the best out of the DQN palette by combining 6 of its powerful variants, crafting what could be called the DQN Megazord: Rainbow.

In this article, we’ll break down the individual components that make up Rainbow, while reviewing their JAX implementations in the Stoix library.

DQN

The fundamental building block of Rainbow is DQN, an extension of Q-learning using a neural network with parameters θ to approximate the Q-function (i.e. action-value function). In particular, DQN uses convolutional layers to extract features from images and a linear layer to produce a scalar estimate of the Q-value.

During training, the network parameterized by θ, referred to as the “online network” is used to select actions while the “target network” parameterized by θ- is a delayed copy of the online network used to provide stable targets. This way, the targets are not dependent on the parameters being updated.
Additionally, DQN uses a replay buffer D to sample past transitions (observations, reward, and done flag tuples) to train on at fixed intervals.

At each iteration i, DQN samples a transition j and takes a gradient step on the following loss:

DQN loss function, all images are made by the author, unless specified otherwise

This loss aims at minimizing the expectation of the squared temporal-difference (TD) error.

Note that DQN is an off-policy algorithm because it learns the optimal policy defined by the maximum Q-value term while following a different behavior policy, such as an epsilon-greedy policy.

Here’s the DQN algorithm in detail:

DQN algorithm

DQN in practice

As mentioned above, we’ll reference code snippets from the Stoix library to illustrate the core parts of DQN and Rainbow (some of the code was slightly edited or commented for pedagogical purposes).

Let’s start with the neural network: Stoix lets us break down our model architecture into a pre-processor and a post-processor, referred to as torso and head respectively. In the case of DQN, the torso would be a multi-layer perceptron (MLP) or convolutional neural network (CNN) and the head an epsilon greedy policy, both implemented as Flax modules:

https://medium.com/media/7b6c514c0bc0bfd45d845f5527f78421/href

Additionally, DQN uses the following loss (note that Stoix follows the Rlax naming conventions, therefore tm1 is equivalent to timestep t in the above equations, while t refers to timestep t+1):

https://medium.com/media/1be20d13144ef4888865ff6d20fa80e8/href

The Rainbow blueprint

Now that we have laid the foundations for DQN, we’ll review each part of the algorithm in more detail, while identifying potential weaknesses and how they are addressed by Rainbow.
In particular, we’ll cover:

Double DQN and the overestimation bias
Dueling DQN and the state-value / advantage prediction
Distributional DQN and the return distribution
Multi-step learning
Noisy DQN and flexible exploration strategies
Prioritized Experience Replay and learning potential

The Rainbow Blueprint, Dall-E 3

Double DQN

Source: Deep Reinforcement Learning with Double Q-learning [3]
Improvement: Reduced overestimation bias

The overestimation bias

One issue with the loss function used in vanilla DQN arises from the Q-target. Remember that we define the target as:

Objective in the DQN loss

This objective may lead to an overestimation bias. Indeed, as DQN uses bootstrapping (learning estimates from estimates), the max term may select overestimated values to update the Q-function, leading to overestimated Q-values.

As an example, consider the following figure:

The Q-values predicted by the network are represented in blue.
The true Q-values are represented in purple.
The gap between the predictions and true values is represented by red arrows.

In this case, action 0 has the highest predicted Q-value because of a large prediction error. This value will therefore be used to construct the target.
However, the action with the highest true value is action 2. This illustration shows how the max term in the target favors large positive estimation errors, inducing an overestimation bias.

Illustration of the overestimation bias.

Decoupling action selection and evaluation

To solve this problem, Hasselt et al. (2015)[3] propose a new target where the action is selected by the online network, while its value is estimated by the target network:

The Double DQN target

By decoupling action selection and evaluation, the estimation bias is significantly reduced, leading to better value estimates and improved performance.

Double DQN provides stable and accurate value estimates, leading to improved performance. Source: Hasselt et al. (2015), Figure 3

Double DQN in practice

As expected, implementing Double DQN only requires us to modify the loss function:

https://medium.com/media/f214383ec1cd89468af2ec6fc067ca39/href

Dueling DQN

Source: Dueling Network Architectures for Deep Reinforcement Learning
Improvement: Separation of the value and advantage computation

State value, Q-value, and advantage

In RL, we use several functions to estimate the value of a given state, action, or sequence of actions from a given state:

State-value V(s): The state value corresponds to the expected return when starting in a given state s and following a policy π thereafter.
Q-value Q(s, a): Similarly, the Q-value corresponds to the expected return when starting in a given state s, taking action a, and following a policy π thereafter.
Advantage A(s, a): The advantage is defined as the difference between the Q-value and the state-value in a given state s for an action a. It represents the inherent value of action a in the current state.

The following figure attempts to represent the differences between these value functions on a backup diagram (note that the state value is weighted by the probability of taking each action under policy π).

Visualization of the state value (in purple), state-action value (Q-function, in blue), and the advantage (in pink) on a backup diagram.

Usually, DQN estimates the Q-value directly, using a feed-forward neural network. This implies that DQN has to learn the Q-values for each action in each state independently.

The dueling architecture

Introduced by Wang et al.[4] in 2016, Dueling DQN uses a neural network with two separate streams of computation:

The state value stream predicts the scalar value of a given state.
The advantage stream predicts to predict the advantage of each action for a given state.

This decoupling enables the independent estimation of the state value and advantages, which has several benefits. For instance, the network can learn state values without having to update the action values regularly. Additionally, it can better generalize to unseen actions in familiar states.
These improvements lead to stabler and faster convergence, especially in environments with many similar-valued actions.

In practice, a dueling network uses a common representation (i.e. a shared linear or convolutional layer) parameterized by parameters θ before splitting into two streams, consisting of linear layers with parameters α and β respectively. The state value stream outputs a scalar value while the advantage stream returns a scalar value for each available action.
Adding the outputs of the two streams allows us to reconstruct the Q-value for each action as Q(s, a) = V(s) + A(s, a).

An important detail is that the mean is usually subtracted from the advantages. Indeed, the advantages need to have zero mean, otherwise, it would be impossible to decompose Q into V and A, making the problem ill-defined. With this constraint, V represents the value of the state while A represents how much better or worse each action is compared to the average action in that state.

Illustration of a dueling network

Dueling Network in practice

Here’s the Stoix implementation of a Q-network:

https://medium.com/media/91e7ff3ec86d5f10983adbe1061653ef/href

Distributional DQN

Source: A distributional perspective on Reinforcement Learning[5]
Improvement: Richer value estimates

The return distribution

Most RL systems model the expectation of the return, however, a promising body of literature approaches RL from a distributional perspective. In this setting, the goal becomes to model the return distribution, which allows us to consider other statistics than the mean.
In 2017, Bellemare et al.[5] published a distributional version of DQN called C51 predicting the return distribution for each action, reaching new state-of-the-art performances on the Atari benchmark.

Illustrated comparison between DQN and C51. Source [5']

Let’s take a step back and review the theory behind C51.
In traditional RL, we evaluate a policy using the Bellman Equation, which allows us to define the Q-function in a recursive form. Alternatively, we can use a distributional version of the Bellman equation, which accounts for randomness in the returns:

Standard and Distributional versions of the Bellman Equation

Here, ρ is the transition function.
The main difference between those functions is that Q is a numerical value, summing expectations over random variables. In contrast, Z is a random variable, summing the reward distribution and the discounted distribution of future returns.

The following illustration helps visualize how to derive Z from the distributional Bellman equation:

Consider the distribution of returns Z at a given timestep and the transition operator Pπ. PπZ is the distribution of future returns Z(s’, a’).
Multiplying this by the discount factor γ contracts the distribution towards 0 (as γ is less than 1).
Adding the reward distribution shifts the previous distribution by a set amount (Note that the figure assumes a constant reward for simplicity. In practice, adding the reward distribution would shift but also modify the discounted return).
Finally, the distribution is projected on a discrete support using an L2 projection operator Φ.

Illustration of the distributional Bellman equation. Source: [5]

This fixed support is a vector of N atoms separated by a constant gap within a set interval:

Definition of the discrete support z

At inference time, the Q-network returns an approximating distribution dt defined on this support with the probability mass pθ(st, at) on each atom i such that:

Predicted return distribution

The goal is to update θ such that the distribution closely matches the true distribution of returns. To learn the probability masses, the target distribution is built using a distributional variant of Bellman’s optimality equation:

Target return distribution

To be able to compare the distribution predicted by our neural network and the target distribution, we need to discretize the target distribution and project it on the same support z.

To this end, we use an L2 projection (a projection onto z such that the difference between the original and projected distribution is minimized in terms of the L2 norm):

L2 projection of the target distribution

Finally, we need to define a loss function that minimizes the difference between the two distributions. As we’re dealing with distributions, we can’t simply subtract the prediction from the target, as we did previously.

Instead, we minimize the Kullback-Leibler divergence between dt and d’t (in practice, this is implemented as a cross-entropy loss):

KL divergence between the projected target and the predicted return distribution

For a more exhaustive description of Distributional DQN, you can refer to Massimiliano Tomassoli’s article[8] as well as Pascal Poupart’s video on the topic[11].

C51 in practice

The key components of C51 in Stoix are the Distributional head and the categorical loss, which uses double Q-learning by default as introduced previously. The choice of defining the C51 network as a head lets us use an MLP or a CNN torso interchangeably depending on the use case.

https://medium.com/media/743444ea903453114bdf41f2b6a6dfa3/href

Noisy DQN

Source: Noisy Networks for Exploration[6]
Improvement: Learnable and state-dependent exploration mechanism

Noisy parameterization of Neural Networks

As many off-policy algorithms, DQN relies on an epsilon-greedy policy as its main exploration mechanism. Therefore, the algorithm will behave greedily with respect to the Q-values most of the time and select random actions with a predefined probability.

Fortunato et al.[6] introduce NoisyNets as a more flexible alternative. NoisyNets are neural networks whose weights and biases are perturbed by a parametric function of Gaussian noise. Similarly to an epsilon-greedy policy, such noise injects randomness in the agent’s action selection, thus encouraging exploration.

However, this noise is scaled and offset by learned parameters, allowing the level of noise to be adapted state-by-state. This way, the balance between exploration and exploitation is optimized dynamically during training. Eventually, the network may learn to ignore the noise, but will do so at different rates in different parts of the state space, leading to more flexible exploration.

A network parameterized by a vector of noisy parameters is defined as follows:

Neural Network parameterized by Noisy parameters

Therefore, a linear layer y = wx + b becomes:

Noisy linear layer

For performance, the noise is generated at inference time using Factorized Gaussian Noise. For a linear layer with M inputs and N outputs, a noise matrix of shape (M x N) is generated as a combination of two noise vectors with size M and N. This methods reduces the number of required random variables from M x N to M + N.
The noise matrix is defined as the outer product of the noise vectors, each scaled by a function f:

Noise generation using Factorised Gaussian Noise

Improved exploration

The improved exploration induced by noisy networks allow a wide range of algorithms, such as DQN, Dueling DQN and A3C to benefit from improved performances with a reasonably low amount of extra parameters.

NoisyNets improve the performance of several algorithms on the Atari benchmark. Source: [6]

Noisy DQN in practice

In Stoix, we implement a noisy layer as follows:

https://medium.com/media/abf0db362f8e8592d6bc71ab78ef1a61/href

Note: All the linear layers in Rainbow are replaced with their noisy equivalent (see the “Assembling Rainbow” section for more details).

Prioritized Experience Replay

Source: Prioritized Experience Replay[7]
Improvement: Prioritization of experiences with higher learning potential

Estimating the Learning Potential

After taking an environment step, vanilla DQN uniformly samples a batch of experiences (also called transitions) from a replay buffer and performs a gradient descent step on this batch. Although this approach produces satisfying results, some specific experiences might be more valuable from a learning perspective than others. Therefore, we could potentially speed up the training process by sampling such experiences more often.

This is precisely the idea explored in the Prioritized Experience Replay (PER) paper published by Schaul et al.[7] in 2016. However, the main question remains: how to approximate the expected learning potential of a transition?

One idealized criterion would be the amount the RL agent can learn from a transition in its current state (expected learning progress). While this measure is not directly accessible, a reasonable proxy is the magnitude of a transition’s TD error δ, which indicates how ‘surprising’ or unexpected the transition is: specifically, how far the value is from its next-step bootstrap estimate (Andre et al., 1998).
Prioritized Experience Replay, Schaul et al. (2016)

As a reminder, the TD error is defined as follows:

The temporal-difference error

This metric is a decent estimate of the learning potential of a specific transition, as a high TD error indicates a large difference between the predicted and actual outcomes, meaning that the agent would benefit from updating its beliefs.

However, it is worth noting that alternative prioritization metrics are still being studied. For instance, Lahire et al.[9] (2022) argue that the optimal sampling scheme is distributed according to the per-sample gradient norms:

Per-sample gradient norms

However, let’s continue with the TD error, as Rainbow uses this metric.

Deriving Sampling Probabilities

Once we have selected the prioritization criterion, we can derive the probabilities of sampling each transition from it. In Prioritized Experience Replay, two alternatives are showcased:

Proportional: Here the probability of replaying a transition is equal to the absolute value of the associated TD error. A small positive constant is added to prevent transitions not being revisited once their error is zero.
Rank-based: In this mode, transitions are ranked in descending order according to their absolute TD error, and their probability is defined based on their rank. This option is supposed to be more robust as it is insensible to outliers.

The sampling probabilities are then normalized and raised to the power α, a hyperparameter determining the degree of prioritization (α=0 is the uniform case).

Prioritization modes and probability normalization

Importance sampling and bias annealing

In RL, the estimation of the expected value of the return relies on the assumption that the updates correspond to the same distribution as the expectation (i.e., the uniform distribution). However, PER introduces bias as we now sample experiences according to their TD error.

To rectify this bias, we use importance sampling, a statistical method used to estimate the properties of a distribution while sampling from a different distribution. Importance sampling re-weights samples so that the estimates remain unbiased and accurate.
Typically, the correcting weights are defined as the ratio of the two probabilities:

Importance sampling ratio

In this case, the target distribution is the uniform distribution, where every transition has a probability of being sampled equal to 1/N, with N being the size of the replay buffer.
Therefore, the importance sampling coefficient in the context of PER is defined by:

Importance sampling weight used in PER

With β a coefficient adjusting the amount of bias correction (the bias is fully corrected for β=1). Finally, the weights are normalized for stability:

Normalization of the importance sampling weights

To summarize, here’s the full algorithm for Prioritized Experience Replay (the update and training steps are identical to DQN):

The Prioritized Experience Replay algorithm

Increased convergence speed with PER

The following plots highlight the performance benefits of PER. Indeed, the proportional and rank-based prioritization mechanisms enable DQN to reach the same baseline performances roughly twice as fast on the Atari benchmark.

Normalized maximum and average scores (in terms of Double DQN performance) on 57 Atari games. Source:[7]

Prioritized Experience Replay in practice

Stoix seamlessly integrates the Flashbax library which provides a variety of replay buffers. Here are the relevant code snippets used to instantiate the replay buffer, compute the sampling probabilities from the TD error, and update the buffer’s priorities:

https://medium.com/media/06179b5ff16f3056c2599ecc457e1d8d/href

Multi-step Learning

Source: Reinforcement Learning: an Introduction, chapter 7
Improvement: Enhanced reward signal and sample efficiency, reduced variance

Multi-step learning is an improvement on traditional one-step temporal difference learning which allows us to consider the return over n steps when building our targets. For instance, instead of considering the reward at the next timestep, we’ll consider the n-step truncated rewards (see the below equation). This process has several advantages, among which:

Immediate feedback: considering a larger time horizon allows the agent to learn the value of state-action pairs much faster, especially in environments where rewards are delayed and specific actions might not pay out immediately.
Sample efficiency: Each update in multi-step learning incorporates information from multiple time steps, making each sample more informative. This improves sample efficiency, meaning the agent can learn more from fewer experiences.
Balancing Bias and Variance: Multi-step methods offer a trade-off between bias and variance. One-step methods have low bias but high variance, while multi-step methods have higher bias but lower variance. By tuning the number of steps, one can find a balance that works best for the given environment.

The multi-step distributional loss used in Rainbow is defined as:

Multi-step target return distribution

In practice, using n-step returns implies a few adjustments to our code:

We now sample trajectories of n experiences, instead of individual experiences
The reward is replaced with the n-step discounted returns
The done flag is set to True if any of the n done flag is True
The next state s(t+1) is replaced by the last observation of the trajectory s(t+n)

Multi-Step learning in practice

Finally, we can reuse the categorical loss function used in C51 with these updated inputs:

https://medium.com/media/f667c9dc3fbefd3824f58ccb43677608/href

Assembling Rainbow

Congratulations on making it this far! We now have a better understanding of all the moving pieces that constitute Rainbow. Here’s a summary of the Rainbow agent:

Neural Network Architecture:
— Torso: A convolutional neural network (CNN) or multi-layer perceptron (MLP) base that creates embeddings for the head network.
— Head: Combines Dueling DQN and C51. The value stream outputs the state value distribution over atoms, while the advantage stream outputs the advantage distribution over actions and atoms. These streams are aggregated, and Q-values are computed as the weighted sum of atom values and their respective probabilities. An action is selected using an epsilon-greedy policy.
— Noisy Layers: All linear layers are replaced with their noisy equivalents to aid in exploration.
Loss Function: Uses a distributional loss modeling the n-step returns, where targets are computed using Double Q-learning.
Replay Buffer: Employs a prioritization mechanism based on the TD error to improve learning efficiency.

Here’s the network used for the Rainbow head:

https://medium.com/media/548655defb33bdb46b76dccff1ec4c25/href

Performances and ablations

To conclude this article, let’s take a closer look at Rainbow’s performances on the Atari benchmark, as well as the ablation study.
The following figure compares Rainbow with the other DQN baselines we studied. The measured metric is the median human-normalized score. In other words, the median human performance on Atari games is set to 100%, which enables us to quickly spot algorithms achieving a human level.

Three of the DQN baselines reach this level after 200 million frames:

Distributional DQN
Dueling DQN
Prioritized Double DQN

Interestingly, Rainbow reaches the same level after only 44 million frames, making it roughly 5 times more sample efficient than the best baselines. At the end of training, it exceeds 200% of the median human-normalized score.

Median human-normalized performance across 57 Atari games. Each line represents a DQN baseline. Source: [2]

This second figure represents the ablation study, which represents the performances of Rainbow without one of its components. These results allow us to make several observations:

The three most crucial components of Rainbow are the distributional head, the use of multi-step learning, and the prioritization of the replay buffer.
Noisy layers contribute significantly to the overall performance. Using standard layers with an epsilon-greedy policy doesn’t allow the agent to reach the 200% score in 200 million frames.
Despite achieving strong performances on their own, the dueling structure and double Q-learning only provide marginal improvements in the context of Rainbow.

Median human-normalized performance across 57 Atari games. Each line represents an ablation of Rainbow. Source: [2]

Thank you very much for reading this article, I hope it provided you with a comprehensive introduction to Rainbow and its components. I highly advise reading through the Stoix implementation of Rainbow for a more detailed description of the training process and the Rainbow architecture.

Until next time 👋

Bibliography

[1] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning, arXiv
[2] Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2017). Rainbow: Combining Improvements in Deep Reinforcement Learning, arXiv.
[3] van Hasselt, H., Guez, A., & Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning, arXiv.
[4] Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning (No. arXiv:1511.06581), arXiv
[5] Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning, arXiv
[5'] Dabney, W., Ostrovski, G., Silver, D., & Munos, R. (2018). Implicit Quantile Networks for Distributional Reinforcement Learning, arXiv
[6] Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., & Legg, S. (2019). Noisy Networks for Exploration, arXiv.
[7] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay, arXiv

Additional resources

[8] Massimiliano Tomassoli, Distributional RL: An intuitive explanation of Distributional RL
[9] Lahire, T., Geist, M., & Rachelson, E. (2022). Large Batch Experience Replay, arXiv.
[10] Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction.
[11] Pascal Poupart, CS885 Module 5: Distributional RL, YouTube

Rainbow: The Colorful Evolution of Deep Q-Networks 🌈 was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Practical Guide to Proximal Policy Optimization in JAX

Ryan Pégoud — Wed, 01 May 2024 05:32:45 GMT

All the tricks and details you wish you knew about PPO

Photo by Lorenzo Herrera on Unsplash

Since its publication in a 2017 paper by OpenAI, Proximal Policy Optimization (PPO) is widely regarded as one of the state-of-the-art algorithms in Reinforcement Learning. Indeed, PPO has demonstrated remarkable performances across various tasks, from attaining superhuman performances in Dota 2 teams to solving a Rubik’s cube with a single robotic hand while maintaining three main advantages: simplicity, stability, and sample efficiency.

However, implementing RL algorithms from scratch is notoriously difficult and error-prone, given the numerous error sources and implementation details to be aware of.

In this article, we’ll focus on breaking down the clever tricks and programming concepts used in a popular implementation of PPO in JAX. Specifically, we’ll focus on the implementation featured in the PureJaxRL library, developed by Chris Lu.

Disclaimer: Rather than diving too deep into theory, this article covers the practical implementation details and (numerous) tricks used in popular versions of PPO. Should you require any reminders about PPO’s theory, please refer to the “references” section at the end of this article. Additionally, all the code (minus the added comments) is copied directly from PureJaxRL for pedagogical purposes.

GitHub - luchris429/purejaxrl: Really Fast End-to-End Jax RL Implementations

Actor-Critic Architectures

Proximal Policy Optimization is categorized within the policy gradient family of algorithms, a subset of which includes actor-critic methods. The designation ‘actor-critic’ reflects the dual components of the model:

The actor network creates a distribution over actions given the current state of the environment and returns an action sampled from this distribution. Here, the actor network comprises three dense layers separated by two activation layers (either ReLU or hyperbolic tangeant) and a final categorical layer applying the softmax function to the computed distribution.
The critic network estimates the value function of the current state, in other words, how good a particular action is at a given time. Its architecture is almost identical to the actor network, except for the final softmax layer. Indeed, the critic network doesn’t apply any activation function to the final dense layer outputs as it performs a regression task.

Actor-critic architecture, as defined in PureJaxRL (illustration made by the author)

Additionally, this implementation pays particular attention to weight initialization in dense layers. Indeed, all dense layers are initialized by orthogonal matrices with specific coefficients. This initialization strategy has been shown to preserve the gradient norms (i.e. scale) during forward passes and backpropagation, leading to smoother convergence and limiting the risks of vanishing or exploding gradients[1].

Orthogonal initialization is used in conjunction with specific scaling coefficients:

Square root of 2: Used for the first two dense layers of both networks, this factor aims to compensate for the variance reduction induced by ReLU activations (as inputs with negative values are set to 0). For the tanh activation, the Xavier initialization is a popular alternative[2].
0.01: Used in the last dense layer of the actor network, this factor helps to minimize the initial differences in logit values before applying the softmax function. This will reduce the difference in action probabilities and thus encourage early exploration.
1: As the critic network is performing a regression task, we do not scale the initial weights.

https://medium.com/media/6f187b7361f6450202f999af5aa17df0/href

Training Loop

The training loop is divided into 3 main blocks that share similar coding patterns, taking advantage of Jax’s functionalities:

Trajectory collection: First, we’ll interact with the environment for a set number of steps and collect observations and rewards.
Generalized Advantage Estimation (GAE): Then, we’ll approximate the expected return for each trajectory by computing the generalized advantage estimation.
Update step: Finally, we’ll compute the gradient of the loss and update the network parameters via gradient descent.

Before going through each block in detail, here’s a quick reminder about the jax.lax.scan function that will show up multiple times throughout the code:

Jax.lax.scan

A common programming pattern in JAX consists of defining a function that acts on a single sample and using jax.lax.scan to iteratively apply it to elements of a sequence or an array, while carrying along some state.
For instance, we’ll apply it to the step function to step our environment N consecutive times while carrying the new state of the environment through each iteration.

In pure Python, we could proceed as follows:

trajectories = []

for step in range(n_steps):
  action = actor_network(obs)
  obs, state, reward, done, info = env.step(action, state)
  trajectories.append(tuple(obs, state, reward, done, info))

However, we avoid writing such loops in JAX for performance reasons (as pure Python loops are incompatible with JIT compilation). The alternative is jax.lax.scan which is equivalent to:

def scan(f, init, xs, length=None):
  """Example provided in the JAX documentation."""
  if xs is None:
    xs = [None] * length

  carry = init
  ys = []
  for x in xs:
    # apply function f to current state
    # and element x
    carry, y = f(carry, x) 
    ys.append(y)
  return carry, np.stack(ys)

Using jax.lax.scan is more efficient than a Python loop because it allows the transformation to be optimized and executed as a single compiled operation rather than interpreting each loop iteration at runtime.

We can see that the scan function takes multiple arguments:

f: A function that is applied at each step. It takes the current state and an element of xs (or a placeholder if xs is None) and returns the updated state and an output.
init: The initial state that f will use in its first invocation.
xs: A sequence of inputs that are iteratively processed by f. If xs is None, the function simulates a loop with length iterations using None as the input for each iteration.
length: Specifies the number of iterations if xs is None, ensuring that the function can still operate without explicit inputs.

Additionally, scan returns:

carry: The final state after all iterations.
ys: An array of outputs corresponding to each step’s application of f, stacked for easy analysis or further processing.

Finally, scan can be used in combination with vmap to scan a function over multiple dimensions in parallel. As we’ll see in the next section, this allows us to interact with several environments in parallel to collect trajectories rapidly.

Illustration of vmap, scan, and scan + vmap in the context of the step function (made by the author)

1. Trajectory Collection

As mentioned in the previous section, the trajectory collection block consists of a step function scanned across N iterations. This step function successively:

Selects an action using the actor network
Steps the environment
Stores transition data in a transition tuple
Stores the model parameters, the environment state, the current observation, and rng keys in a runner_state tuple
Returns runner_state and transition

Scanning this function returns the latest runner_state and traj_batch, an array of transition tuples. In practice, transitions are collected from multiple environments in parallel for efficiency as indicated by the use of jax.vmap(env.step, …)(for more details about vectorized environments and vmap, refer to my previous article).

https://medium.com/media/ee2875a6bdb941f399155c6c0904c4c0/href

2. Generalized Advantage Estimation

After collecting trajectories, we need to compute the advantage function, a crucial component of PPO’s loss function. The advantage function measures how much better a specific action is compared to the average action in a given state:

Where Gt is the return at time t and V(St) is the value of state s at time t.

As the return is generally unknown, we have to approximate the advantage function. A popular solution is generalized advantage estimation[3], defined as follows:

With γ the discount factor, λ a parameter that controls the trade-off between bias and variance in the estimate, and δt the temporal difference error at time t:

As we can see, the value of the GAE at time t depends on the GAE at future timesteps. Therefore, we compute it backward, starting from the end of a trajectory. For example, for a trajectory of 3 transitions, we would have:

Which is equivalent to the following recursive form:

Once again, we use jax.lax.scan on the trajectory batch (this time in reverse order) to iteratively compute the GAE.

https://medium.com/media/36dd1edacd3ecf53a1d203f46999f828/href

Note that the function returns advantages + traj_batch.value as a second output, which is equivalent to the return according to the first equation of this section.

3. Update step

The final block of the training loop defines the loss function, computes its gradient, and performs gradient descent on minibatches. Similarly to previous sections, the update step is an arrangement of several functions in a hierarchical order:

def _update_epoch(update_state, unused):
  """
  Scans update_minibatch over shuffled and permuted 
  mini batches created from the trajectory batch.
  """

  def _update_minbatch(train_state, batch_info):
    """
    Wraps loss_fn and computes its gradient over the 
    trajectory batch before updating the network parameters.
    """
    ...
    
    def _loss_fn(params, traj_batch, gae, targets):
      """
      Defines the PPO loss and computes its value.
      """
      ...

Let’s break them down one by one, starting from the innermost function of the update step.

3.1 Loss function

This function aims to define and compute the PPO loss, originally defined as:

Where:

However, the PureJaxRL implementation features some tricks and differences compared to the original PPO paper[4]:

The paper defines the PPO loss in the context of gradient ascent whereas the implementation performs gradient descent. Therefore, the sign of each loss component is reversed.
The value function term is modified to include an additional clipped term. This could be seen as a way to make the value function updates more conservative (as for the clipped surrogate objective):

The GAE is standardized.

Here’s the complete loss function:

https://medium.com/media/46f2d043c070808a7da5d97342afe905/href

3.2 Update Minibatch

The update_minibatch function is essentially a wrapper around loss_fn used to compute its gradient over the trajectory batch and update the model parameters stored in train_state.

https://medium.com/media/a6798898fe3ef8800e8354098b03aaa8/href

3.3 Update Epoch

Finally, update_epoch wraps update_minibatch and applies it on minibatches. Once again, jax.lax.scan is used to apply the update function on all minibatches iteratively.

https://medium.com/media/725498766e43cfe26f21b1961bb49d01/href

Conclusion

From there, we can wrap all of the previous functions in an update_step function and use scan one last time for N steps to complete the training loop.

A global view of the training loop would look like this:

https://medium.com/media/8408eb84bc2b05ecd9c2ae8ebebc8c4e/href

We can now run a fully compiled training loop using jax.jit(train(rng)) or even train multiple agents in parallel using jax.vmap(train(rng)).

There we have it! We covered the essential building blocks of the PPO training loop as well as common programming patterns in JAX.

To go further, I highly recommend reading the full training script in detail and running example notebooks on the PureJaxRL repository.

GitHub - luchris429/purejaxrl: Really Fast End-to-End Jax RL Implementations

Thank you very much for your support, until next time 👋

References:

Full training script, PureJaxRL, Chris Lu, 2023

[1] Explaining and illustrating orthogonal initialization for recurrent neural networks, Smerity, 2016

[2] Initializing neural networks, DeepLearning.ai

[3] Generalized Advantage Estimation in Reinforcement Learning, Siwei Causevic, Towards Data Science, 2023

[4] Proximal Policy Optimization Algorithms, Schulman et Al., OpenAI, 2017

A Practical Guide to Proximal Policy Optimization in JAX was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Gentle Introduction to Deep Reinforcement Learning in JAX

Ryan Pégoud — Tue, 21 Nov 2023 17:51:55 GMT

Solving the CartPole environment with DQN in under a second

Photo by Thomas Despeyroux on Unsplash

Recent progress in Reinforcement Learning (RL), such as Waymo’s autonomous taxis or DeepMind’s superhuman chess-playing agents, complement classical RL with Deep Learning components such as Neural Networks and Gradient Optimization methods.

Building on the foundations and coding principles introduced in one of my previous stories, we’ll discover and learn to implement Deep Q-Networks (DQN) and replay buffers to solve OpenAI’s CartPole environment. All of that in under a second using JAX!

For an introduction to JAX, vectorized environments, and Q-learning, please refer to the content of this story:

Vectorize and Parallelize RL Environments with JAX: Q-learning at the Speed of Light⚡

Our framework of choice for deep learning will be DeepMind’s Haiku library, which I recently introduced in the context of Transformers:

Implementing a Transformer Encoder from Scratch with JAX and Haiku 🤖

This article will cover the following sections:

Why do we need Deep RL?
Deep Q-Networks, theory and practice
Replay Buffers
Translating the CartPole environment to JAX
The JAX way to write efficient training loops

As always, all the code presented in this article is available on GitHub:

GitHub - RPegoud/jym: JAX implementation of RL algorithms and vectorized environments

Why do we need Deep RL?

In previous articles, we introduced Temporal Difference Learning algorithms and in particular Q-learning.

Simply put, Q-learning is an off-policy algorithm (the target policy is not the policy used for decision-making) maintaining and updating a Q-table, an explicit mapping of states to corresponding action values.

While Q-learning is a practical solution for environments with discrete action spaces and restricted observation spaces, it struggles to scale well to more complex environments. Indeed, creating a Q-table requires defining the action and observation spaces.

Consider the example of autonomous driving, the observation space is composed of an infinity of potential configurations derived from camera feeds and other sensory inputs. On the other hand, the action space includes a wide spectrum of steering wheel positions and varying levels of force applied to the brake and accelerator.

Even though we could theoretically discretize the action space, the sheer volume of possible states and actions leads to an impractical Q-table in real-world applications.

Photo by Kirill Tonkikh on Unsplash

Finding optimal actions in large and complex state-action spaces thus requires powerful function approximation algorithms, which is precisely what Neural Networks are. In the case of Deep Reinforcement Learning, neural nets are used as a replacement for the Q-table and provide an efficient solution to the curse of dimensionality introduced by large state spaces. Furthermore, we do not need to explicitly define the observation space.

Deep Q-Networks & Replay Buffers

DQN uses two types of neural networks in parallel, starting with the “online” network which is used for Q-value prediction and decision-making. On the other hand, the “target” network is used to create stable Q-targets to assess the performance of the online net via the loss function.

Similarly to Q-learning, DQN agents are defined by two functions: act and update.

Act

The act function implements an epsilon-greedy policy with respect to Q-values, which are estimated by the online neural network. In other words, the agent selects the action corresponding to the maximum predicted Q-value for a given state, with a set probability of acting randomly.

You might remember that Q-learning updates its Q-table after every step, however, in Deep Learning it is common practice to compute updates using gradient descent on a batch of inputs.

For this reason, DQN stores experiences (tuples containing state, action, reward, next_state, done_flag) in a replay buffer. To train the network, we’ll sample a batch of experiences from this buffer instead of using only the last experience (more details in the Replay Buffer section).

Visual representation of DQN’s action selection process (Made by the author)

Here’s a JAX implementation of the action-selection part of DQN:

https://medium.com/media/ad1820a30301651c4e329820bce2e4cf/href

The only subtlety of this snippet is that the model attribute doesn’t contain any internal parameters as is usually the case in frameworks such as PyTorch or TensorFlow.

Here, the model is a function representing a forward pass through our architecture, but the mutable weights are stored externally and passed as arguments. This explains why we can use jit while passing the self argument as static (the model being stateless as other class attributes).

Update

The update function is responsible for training the network. It computes a mean squared error (MSE) loss based on the temporal-difference (TD) error:

Mean Squared Error used in DQN

In this loss function, θ denotes the parameters of the online network, and θ− represents the parameters of the target network. The parameters of the target network are set on the online network’s parameters every N steps, similar to a checkpoint (N is a hyperparameter).

This separation of parameters (with θ for the current Q-values and θ− for the target Q-values) is crucial to stabilize training.

Using the same parameters for both would be similar to aiming at a moving target, as updates to the network would immediately shift the target values. By periodically updating θ− (i.e. freezing these parameters for a set number of steps), we ensure stable Q-targets while the online network continues to learn.

Finally, the (1-done) term adjusts the target for terminal states. Indeed, when an episode ends (i.e. ‘done’ is equal to 1), there is no next state. Therefore, the Q-value for the next state is set to 0.

Visual representation of DQN’s parameter update process (Made by the author)

Implementing the update function for DQN is slightly more complex, let’s break it down:

First, the _loss_fn function implements the squared error described previously for a single experience.
Then, _batch_loss_fn acts as a wrapper for _loss_fn and decorates it with vmap, applying the loss function to a batch of experiences. We then return the average error for this batch.
Finally, update acts as a final layer to our loss function, computing its gradient with respect to the online network parameters, the target network parameters, and a batch of experiences. We then use Optax (a JAX library commonly used for optimization) to perform an optimizer step and update the online parameters.

https://medium.com/media/790a31d33db9a5b9411537a0f12ee2d2/href

Notice that, similarly to the replay buffer, the model and optimizer are pure functions modifying an external state. The following line serves as a good illustration of this principle:

updates, optimizer_state = optimizer.update(grads, optimizer_state)

This also explains why we can use a single model for both the online and target networks, as the parameters are stored and updated externally.

# target network predictions
self.model.apply(target_net_params, None, state)
# online network predictions
self.model.apply(online_net_params, None, state)

For context, the model we use in this article is a multi-layer perceptron defined as follows:

N_ACTIONS = 2
NEURONS_PER_LAYER = [64, 64, 64, N_ACTIONS]
online_key, target_key = vmap(random.PRNGKey)(jnp.arange(2) + RANDOM_SEED)

@hk.transform
def model(x):
    # simple multi-layer perceptron
    mlp = hk.nets.MLP(output_sizes=NEURONS_PER_LAYER)
    return mlp(x)

online_net_params = model.init(online_key, jnp.zeros((STATE_SHAPE,)))
target_net_params = model.init(target_key, jnp.zeros((STATE_SHAPE,)))

prediction = model.apply(online_net_params, None, state)

Replay Buffer

Now let us take a step back and look closer at replay buffers. They are widely used in reinforcement learning for a variety of reasons:

Generalization: By sampling from the replay buffer, we break the correlation between consecutive experiences by mixing up their order. This way, we avoid overfitting to specific sequences of experiences.
Diversity: As the sampling is not limited to recent experiences, we generally observe a lower variance in updates and prevent overfitting to the latest experiences.
Increased sample efficiency: Each experience can be sampled multiple times from the buffer, enabling the model to learn more from individual experiences.

Finally, we can use several sampling schemes for our replay buffer:

Uniform sampling: Experiences are sampled uniformly at random. This type of sampling is straightforward to implement and allows the model to learn from experiences independently from the timestep they were collected.
Prioritized sampling: This category includes different algorithms such as Prioritized Experience Replay (“PER”, Schaul et al. 2015) or Gradient Experience Replay (“GER”, Lahire et al., 2022). These methods attempt to prioritize the selection of experiences according to some metric related to their “learning potential” (the amplitude of the TD error for PER and the norm of the experience’s gradient for GER).

For the sake of simplicity, we’ll implement a uniform replay buffer in this article. However, I plan to cover prioritized sampling extensively in the future.

As promised, the uniform replay buffer is quite easy to implement, however, there are a few complexities related to the use of JAX and functional programming. As always, we have to work with pure functions that are devoid of side effects. In other words, we are not allowed to define the buffer as a class instance with a variable internal state.

Instead, we initialize a buffer_state dictionary that maps keys to empty arrays with predefined shapes, as JAX requires constant-sized arrays when jit-compiling code to XLA.

buffer_state = {
    "states": jnp.empty((BUFFER_SIZE, STATE_SHAPE), dtype=jnp.float32),
    "actions": jnp.empty((BUFFER_SIZE,), dtype=jnp.int32),
    "rewards": jnp.empty((BUFFER_SIZE,), dtype=jnp.int32),
    "next_states": jnp.empty((BUFFER_SIZE, STATE_SHAPE), dtype=jnp.float32),
    "dones": jnp.empty((BUFFER_SIZE,), dtype=jnp.bool_),
}

We will use a UniformReplayBuffer class to interact with the buffer state. This class has two methods:

add: Unwraps an experience tuple and maps its components to a specific index. idx = idx % self.buffer_size ensures that when the buffer is full, adding new experiences overwrites older ones.
sample: Samples a sequence of random indexes from the uniform random distribution. The sequence length is set by batch_size while the range of the indexes is [0, current_buffer_size-1]. This ensures that we do not sample empty arrays while the buffer is not yet full. Finally, we use JAX’s vmap in combination with tree_map to return a batch of experiences.

https://medium.com/media/023105f4de64298471d0d67c6cd74853/href

Translating the CartPole environment to JAX

Now that our DQN agent is ready for training, we’ll quickly implement a vectorized CartPole environment using the same framework as introduced in an earlier article. CartPole is a control environment having a large continuous observation space, which makes it relevant to test our DQN.

Visual representation of the CartPole Environment (credits and documentation: OpenAI Gymnasium, MIT license)

The process is quite straightforward, we reuse most of OpenAI’s Gymnasium implementation while making sure we use JAX arrays and lax control flow instead of Python or Numpy alternatives, for instance:

# Python implementation
force = self.force_mag if action == 1 else -self.force_mag
# Jax implementation
force = lax.select(jnp.all(action) == 1, self.force_mag, -self.force_mag)            )

# Python
costheta, sintheta = math.cos(theta), math.sin(theta)
# Jax
cos_theta, sin_theta = jnp.cos(theta), jnp.sin(theta)

# Python
if not terminated:
  reward = 1.0
...
else: 
  reward = 0.0
# Jax
reward = jnp.float32(jnp.invert(done))

For the sake of brevity, the full environment code is available here:

jym/src/envs/control/cartpole.py at main · RPegoud/jym

The JAX way to write efficient training loops

The last part of our implementation of DQN is the training loop (also called rollout). As mentioned in previous articles, we have to respect a specific format in order to take advantage of JAX’s speed.

The rollout function might appear daunting at first, but most of its complexity is purely syntactic as we’ve already covered most of the building blocks. Here’s a pseudo-code walkthrough:

1. Initialization:
  * Create empty arrays that will store the states, actions, rewards 
    and done flags for each timestep. Initialize the networks and optimizer
    with dummy arrays.
  * Wrap all the initialized objects in a val tuple

2. Training loop (repeat for i steps):
  * Unpack the val tuple
  * (Optional) Decay epsilon using a decay function
  * Take an action depending on the state and model parameters
  * Perform an environment step and observe the next state, reward 
    and done flag
  * Create an experience tuple (state, action, reward, new_state, done)
    and add it to the replay buffer
  * Sample a batch of experiences depending on the current buffer size
    (i.e. sample only from experiences that have non-zero values)
  * Update the model parameters using experience batch
  * Every N steps, update the target network's weights 
    (set target_params = online_params)
  * Store the experience's values for the current episode and return 
    the updated `val` tuple

https://medium.com/media/ea8779e9ebf69d62835d6e063791864f/href

We can now run DQN for 20,000 steps and observe the performances. After around 45 episodes, the agent manages to obtain decent performances, balancing the pole for more than 100 steps consistently.

The green bars indicate that the agent managed to balance the pole for more than 200 steps, solving the environment. Notably, the agent set its record on the 51st episode, with 393 steps.

https://medium.com/media/3c385a3226b6585e8d462f5d95447bc4/href

The 20.000 training steps were executed in just over a second, at a rate of 15.807 steps per second (on a single CPU)!

These performances hint at JAX’s impressive scaling capabilities, allowing practitioners to run large-scale parallelized experiments with minimal hardware requirements.

Running for 20,000 iterations: 100%|██████████| 20000/20000 [00:01<00:00, 15807.81it/s]

We’ll take a closer look at parallelized rollout procedures to run statistically significant experiments and hyperparameter searches in a future article!

In the meantime, feel free to reproduce the experiment and dabble with hyperparameters using this notebook:

jym/notebooks/control/cartpole/dqn_cartpole.ipynb at main · RPegoud/jym

Conclusion

As always, thanks for reading this far! I hope this article provided a decent introduction to Deep RL in JAX. Should you have any questions or feedback related to the content of this article, make sure to let me know, I’m always happy to have a little chat ;)

Until next time 👋

Credits:

Cartpole Gif, OpenAI Gymnasium library, (MIT license)

A Gentle Introduction to Deep Reinforcement Learning in JAX was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Implementing a Transformer Encoder from Scratch with JAX and Haiku

Ryan Pégoud — Tue, 07 Nov 2023 14:54:41 GMT

Understanding the fundamental building blocks of Transformers.

Transformers, in the style of Edward Hopper (generated by Dall.E 3)

Introduced in 2017 in the seminal paper “Attention is all you need”[0], the Transformer architecture is arguably one of the most impactful breakthroughs in recent Deep Learning history, enabling the rise of large language models and even finding use in fields such as computer vision.

Succeeding to former state-of-the-art architectures relying on recurrence such as Long Short-Term Memory (LSTM) networks or Gated Recurrent Units (GRU), Transformers introduce the concept of self-attention, coupled with an encoder/decoder architecture.

In this article, we’ll implement the first half of a Transformer, the Encoder, from scratch and step by step. We’ll use JAX as our main framework along with Haiku, one of DeepMind’s deep learning libraries.

In case you are unfamiliar with JAX or need a fresh reminder about its amazing functionalities, I’ve already covered the topic in the context of Reinforcement Learning in my previous article:

Vectorize and Parallelize RL Environments with JAX: Q-learning at the Speed of Light⚡

We’ll go over each of the blocks that make up the encoder and learn to implement them efficiently. In particular, the outline of this article contains:

The Embedding Layer and Positional Encodings
Multi-Head Attention
Residual Connections and Layer Normalization
Position-wise Feed-Forward Networks

Disclaimer: this article is not intended to be a complete introduction to these notions as we’ll focus on implementation first. If needed, please refer to the resources at the end of this post.

As always, the fully commented code for this article as well as illustrated notebooks are available on GitHub, feel free to star the repository if you enjoyed the article!

GitHub - RPegoud/jab: A collection of foundational Deep Learning models implemented in JAX

Main parameters

Before we get started, we need to define a few parameters that will play a crucial role in the encoder block:

Sequence Length (seq_len): The number of tokens or words in a sequence.
Embedding Dimension (embed_dim): The dimension of the embeddings, in other words, the number of numerical values used to describe a single token or word.
Batch Size (batch_size): The size of a batch of inputs, i.e. the number of sequences processed at the same time.

The input sequences to our encoder model will typically be of shape (batch_size, seq_len). In this article, we’ll use batch_size=32 and seq_len=10, which means that our encoder will simultaneously process 32 sequences of 10 words.

Paying attention to the shape of our data at each step of the processing will enable us to better visualize and understand how the data flows in the encoder block. Here’s a high-level overview of our encoder, we’ll start from the bottom with the embedding layer and positional encodings:

Representation of the Transformer Encoder block (made by the author)

Embedding Layer and Positional Encodings

As mentioned previously, our model takes batched sequences of tokens as inputs. Generating those tokens could be as simple as collecting a set of unique words in our dataset, and assigning an index to each of them. Then we would sample 32 sequences of 10 words and replace each word with its index in the vocabulary. This procedure would provide us with an array of shape (batch_size, seq_len), as expected.

We are now ready to get started with our Encoder. The first step is to create “positional embeddings” for our sequences. Positional embeddings are the sum of word embeddings and positional encodings.

Word Embeddings

Word embeddings allow us to encode the meaning and semantic relations between words in our vocabulary. In this article, the embedding dimension is fixed to 64. This means that each word is represented by a 64-dimensional vector so that words with similar meanings have similar coordinates. Moreover, we can manipulate these vectors to extract relations between words, as depicted below.

Example of analogies derived from word embeddings (image from developers.google.com)

Using Haiku, generating learnable embeddings is as simple as calling:

hk.Embed(vocab_size, embed_dim)

These embeddings will be updated along with other learnable parameters during model training (more on that in a second).

Positional Encodings

As opposed to recurrent neural nets, Transformers can’t infer the position of a token given a shared hidden state as they lack recurrent or convolutional structures. Hence the introduction of positional encodings, vectors that convey a token’s position in the input sequence.

Essentially, each token is assigned a positional vector composed of alternating sine and cosine values. Those vectors match the dimensionality of word embeddings so that both can be summed.

In particular, the original Transformer paper uses the following functions:

Positional Encoding functions (reproduced from “Attention is all you need”, Vaswani et al. 2017)

The below figures enable us to further understand the functioning of positional encodings. Let’s take a look at the first row of the uppermost plot, we can see alternating sequences of zeros and ones. Indeed, rows represent the position of a token in the sequence (the pos variable) while columns represent the embedding dimension (the i variable).

Therefore, when pos=0, the previous equations return sin(0)=0 for even embedding dimensions and cos(0)=1 for odd dimensions.

Moreover, we see that adjacent rows share similar values, whereas the first and last rows are wildly different. This property is helpful for the model to assess the distance between words in the sequence as well as their ordering.

Finally, the third plot represents the sum of positional encodings and embeddings, which is the output of the embedding block.

https://medium.com/media/71e9f04e859d0202fffef34771d108fc/href

Using Haiku, we define the embedding layer as follows. Similarly to other deep learning frameworks, Haiku allows us to define custom modules (here hk.Module) to store learnable parameters and define the behavior of our model’s components.

Each Haiku module needs to have an __init__and __call__function. Here, the call function simply computes the embeddings using the hk.Embed function and the positional encodings, before summing them.

The positional encoding function uses JAX functionalities such as vmapand lax.condfor performance. If you are unfamiliar with those functions, feel free to check out my previous post where they are presented more in-depth.

Put simply, vmapallows us to define a function for a single sample and vectorize it so that it can be applied to batches of data. The in_axesparameter is used to specify that we want to iterate over the first axis of the dim input, which is the embedding dimension. On the other hand, lax.cond is an XLA-compatible version of a Python if/else statement.

https://medium.com/media/ebcac5ffcd7cc380167dcb3fdb61de75/href

Self-attention and MultiHead-Attention

Attention aims to compute the importance of each word in a sequence, relative to an input word. For example, in the sentence:

“The black cat jumped on the sofa, lied down and fell asleep, as it was tired”.

The word “it” could be quite ambiguous for the model, as technically, it could refer to both “cat” and “sofa”. A well-trained attention model would be able to understand that “it” refers to “cat” and therefore assign attention values to the rest of the sentence accordingly.

Essentially, attention values could be seen as weights that describe the importance of a certain word given the context of the input word. For instance, the attention vector for the word “jumped” would have high values for words like “cat” (what jumped?), “on”, and “sofa” (where did it jump?) as these words are relevant to its context.

Visual representation of an attention vector (made by the author)

In the Transformer paper, attention is computed using Scaled Dot-Product Attention. Which is summarized by the formula:

Scaled Dot-Product Attention (reproduced from “Attention is all you need”, Vaswani et al. 2017)

Here, Q,K and V stand for Queries, Keys and Values. These matrices are obtained by multiplying learned weight vectors WQ, WK and WV with positional embeddings.

These names are mainly abstractions used to help understand how the information is processed and weighted in the attention block. They are an allusion to retrieval systems vocabulary[2] (e.g. searching a video on YouTube for instance).

Here’s an intuitive explanation:

Queries: They can be interpreted as a “set of questions” about all the positions in a sequence. For instance, interrogating the context of a word and trying to identify the most relevant parts of the sequence.
Keys: They can be seen as holding information that the queries interact with, the compatibility between a query and a key determines how much attention the query should pay to the corresponding value.
Values: Matching keys and queries allows us to decide which keys are relevant, values are the actual content paired with the keys.

In the following figure, the query is a YouTube search, the keys are the video descriptions and metadata, while the value are the associated videos.

Intuitive representation of the Queries, Keys, Values concept (made by the author)

In our case, queries, keys, and values come from the same source (as they’re derived from the input sequences), hence the name self-attention.

The computation of attention scores is usually executed multiple times in parallel, each time with a fraction of the embeddings. This mechanism is called “Multi-Head Attention” and enables each head to learn several different representations of the data in parallel, leading to a more robust model.

A single attention head would generally process arrays with shape (batch_size, seq_len, d_k) where d_kcan be set as the ratio between the number of heads and the dimension of the embeddings (d_k = n_heads/embed_dim). This way, concatenating the outputs of each head conveniently gives an array with shape (batch_size, seq_len, embed_dim), as the input.

The computation of attention matrices can be broken down into several steps:

First, we define learnable weight vectors WQ, WK, and WV. These vectors have shapes (n_heads, embed_dim, d_k).
In parallel, we multiply the positional embeddings with the weight vectors. We obtain Q, K, and V matrices with shapes (batch_size, seq_len, d_k).
We then scale the dot-product of Q and K (transposed). This scaling involves dividing the result of the dot-product by the square root of d_kand applying the softmax function on the matrices rows. Therefore, attention scores for an input token (i.e. a row) sum up to one, this helps prevent values from becoming too large and slowing down computation. The output has shape (batch_size, seq_len, seq_len)
Finally, we dot the result of the previous operation with V, making the shape of the output (batch_size, seq_len, d_k).

Visual representation of matrix operations inside an attention block (made by the author)

The outputs of each attention head can then be concatenated to form a matrix with shape (batch_size, seq_len, embed_dim). The Transformer paper also adds a linear layer at the end of the multi-head attention module, to aggregate and combine the learned representations from all the attention heads.

Concatenation of multi-head attention matrices and linear layer (made by the author)

In Haiku, the Multi-Head Attention module can be implemented as follows. The __call__function follows the same logic as the above graph while the class methods take advantage of JAX utilities such as vmap(to vectorize our operations over the different attention heads and matrices) and tree_map(to map matrix dot-products over weight vectors).

https://medium.com/media/d3b7c8111f21d6f08c206b778f7be2f7/href

Residual Connections and Layer Normalization

As you might have noticed on the Transformer graph, the multi-head attention block and the feed-forward net are followed by residual connections and layer normalization.

Residual or skip connections

Residual connections are a standard solution to solve the vanishing gradient problem, which occurs when gradients become too small to effectively update the model’s parameters.

As this issue naturally arises in particularly deep architectures, residual connections are used in a variety of complex models such as ResNet (Kaiming et al, 2015) in computer vision, AlphaZero (Silver et al, 2017) in reinforcement learning, and of course, Transformers.

In practice, residual connections simply forward the output of a specific layer to a following one, skipping one or more layers on the way. For instance, the residual connection around the multi-head attention is equivalent to summing the output of multi-head attention with positional embeddings.

This enables gradients to flow more efficiently through the architecture during backpropagation and can usually lead to faster convergence and more stable training.

Representation of residual connections in Transformers (made by the author)

Layer Normalization

Layer normalization helps ensure that the values propagated through the model do not “explode” (tend toward infinity), which could easily happen in attention blocks, where several matrices are multiplied during each forward pass.

Unlike batch normalization, which normalizes across the batch dimension assuming a uniform distribution, layer normalization operates across the features. This approach is suitable for sentence batches where each may have unique distributions due to varying meanings and vocabularies.

By normalizing across features, such as embeddings or attention values, layer normalization standardizes data to a consistent scale without conflating distinct sentence characteristics, maintaining the unique distribution of each.

Representation of Layer Normalization in the context of Transformers (made by the author)

The implementation of layer normalization is pretty straightforward, we initialize the learnable parameters alpha and beta and normalize along the desired feature axis.

https://medium.com/media/4259fcbe42c4c4f7e646d6526e3b2ff0/href

Position-wise Feed-Forward Network

The last component of the encoder that we need to cover is the position-wise feed-forward network. This fully connected network takes the normalized outputs of the attention block as inputs and is used to introduce non-linearity and increase the model’s capacity to learn complex functions.

It is composed of two dense layers separated by a gelu activation:

https://medium.com/media/664d0958c76af2c37963a3ca9be4e35f/href

After this block, we have another residual connection and layer normalization to complete the encoder.

Wrapping up

There we have it! By now you should be familiar with the main concepts of the Transformer encoder. Here’s the full encoder class, notice that in Haiku, we assign a name to each layer, so that learnable parameters are separated and easy to access. The __call__function provides a good summary of the different steps of our encoder:

https://medium.com/media/f79e91cc89202e6ffaf523620c130ed2/href

To use this module on actual data, we have to apply hk.transform to a function encapsulating the encoder class. Indeed, you might remember that JAX embraces the functional programming paradigm, therefore, Haiku follows the same principles.

We define a function containing an instance of the encoder class and return the output of a forward pass. Applying hk.transform returns a transformed object having access to two functions: init and apply.

The former enables us to initialize the module with a random key as well as some dummy data (notice that here we pass an array of zeros with shape batch_size, seq_len) while the latter allows us to process real data.

https://medium.com/media/61a044ea8f07ec0257e911e68f17f2f0/href

# Note: the two following syntaxes are equivalent
# 1: Using transform as a class decorator
@hk.transform
def encoder(x):
  ...
  return model(x) 
 
encoder.init(...)
encoder.apply(...)

# 2: Applying transfom separately
def encoder(x):
  ...
  return model(x)

encoder_fn = hk.transform(encoder)
encoder_fn.init(...)
encoder_fn.apply(...)

In the next article, we’ll complete the transformer architecture by adding a decoder, which reuses most of the blocks we introduced so far, and learn how to train a model on a specific task using Optax!

Conclusion

Thank you for reading this far, if you are interested in dabbling with the code, you can find it fully commented on GitHub, along with additional details and a walkthrough using a toy dataset.

GitHub - RPegoud/jab: A collection of foundational Deep Learning models implemented in JAX

If you’d like to dig deeper into Transformers, the following section contains some articles that helped me redact this article.

Until next time 👋

References and Resources:

[1] Attention is all you need (2017), Vaswani et al, Google

[2] What exactly are keys, queries, and values in attention mechanisms? (2019) Stack Exchange

[3] The Illustrated Transformer (2018), Jay Alammar

[4] A Gentle Introduction to Positional Encoding in Transformer Models (2023), Mehreen Saeed, Machine Learning Mastery

Image Credits

Word embeddings, developers.google.com
Cat picture, Karsten Winegeart, Unsplash
Norway landscape, Pascal Debrunner, Unsplash
Dog picture, Loan, Unsplash

Implementing a Transformer Encoder from Scratch with JAX and Haiku 🤖 was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Vectorize and Parallelize RL Environments with JAX: Q-learning at the Speed of Light⚡

Ryan Pégoud — Sun, 15 Oct 2023 15:41:03 GMT

In this article, we learn to vectorize an RL environment and train 30 Q-learning agents in parallel on a CPU, at 1.8 million iterations per second.

Image by Google DeepMind on Unsplash

In the previous story, we introduced Temporal-Difference Learning, particularly Q-learning, in the context of a GridWorld.

Temporal-Difference Learning and the importance of exploration: An illustrated guide

While this implementation served the purpose of demonstrating the differences in performances and exploration mechanisms of these algorithms, it was painfully slow.

Indeed, the environment and agents were mainly coded in Numpy, which is by no means a standard in RL, even though it makes the code easy to understand and debug.

In this article, we’ll see how to scale up RL experiments by vectorizing environments and seamlessly parallelizing the training of dozens of agents using JAX. In particular, this article covers:

JAX basics and useful features for RL
Vectorized environment and why they are so fast
Implementation of an environment, policy, and Q-learning agent in JAX
Single-agent training
How to parallelize agent training, and how easy it is!

All the code featured in this article is available on GitHub:

GitHub - RPegoud/jym: JAX implementation of RL algorithms and vectorized environments

JAX Basics

JAX is yet another Python Deep Learning framework developed by Google and widely used by companies such as DeepMind.

“JAX is Autograd (automatic differenciation) and XLA (Accelerated Linear Algebra, a TensorFlow compiler), brought together for high-performance numerical computing.” — Official Documentation

As opposed to what most Python developers are used to, JAX doesn’t embrace the object-oriented programming (OOP) paradigm, but rather functional programming (FP)[1].

Put simply, it relies on pure functions (deterministic and without side effects) and immutable data structures (instead of changing the data in place, new data structures are created with the desired modifications) as primary building blocks. As a result, FP encourages a more functional and mathematical approach to programming, making it well-suited for tasks like numerical computing and machine learning.

Let’s illustrate the differences between those two paradigms by looking at pseudocode for a Q-update function:

The object-oriented approach relies on a class instance containing various state variables (such as the Q-values). The update function is defined as a class method that updates the internal state of the instance.
The functional programming approach relies on a pure function. Indeed, this Q-update is deterministic as the Q-values are passed as an argument. Therefore, any call to this function with the same inputs will result in the same outputs whereas a class method’s outputs may depend on the internal state of the instance. Also, data structures such as arrays are defined and modified in the global scope.

Implementing a Q-update in Object-Oriented Programming and Functional Programming (made by the author)

As such, JAX offers a variety of function decorators that are particularly useful in the context of RL:

vmap (vectorized map): Allows a function acting on a single sample to be applied on a batch. For instance, if env.step() is a function performing a step in a single environment, vmap(env.step)() is a function performing a step in multiple environments. In other words, vmap adds a batch dimension to a function.

Illustration of a step function vectorized using vmap (made by the author)

jit (just-in-time compilation): Allows JAX to perform a “Just In Time compilation of a JAX Python function” making it XLA-compatible. Essentially, using jit allows us to compile functions and provides significant speed improvements (in exchange for some additional overhead when first compiling the function).
pmap (parallel map): Similarly to vmap, pmap enables easy parallelization. However, instead of adding a batch dimension to a function, it replicates the function and executes it on several XLA devices. Note: when applying pmap, jit is also applied automatically.

Illustration of a step function parallelized using pmap (made by the author)

Now that we have laid down the basics of JAX, we’ll see how to obtain massive speed-ups by vectorizing environments.

Vectorized Environments:

First, what is a vectorized environment and what problems does vectorization solve?

In most cases, RL experiments are slowed down by CPU-GPU data transfers. Deep Learning RL algorithms such as Proximal Policy Optimization (PPO) use Neural Networks to approximate the policy.

As always in Deep Learning, Neural Networks use GPUs at training and inference time. However, in most cases, environments run on the CPU (even in the case of multiple environments being used in parallel).

This means that the usual RL loop of selecting actions via the policy (Neural Networks) and receiving observations and rewards from the environment requires constant back-and-forths between the GPU and the CPU, which hurts performance.

In addition, using frameworks such as PyTorch without “jitting” might cause some overhead, since the GPU might have to wait for Python to send back observations and rewards from the CPU.

Usual RL batched training setup in PyTorch (made by the author)

On the other hand, JAX enables us to easily run batched environments on the GPU, removing the friction caused by GPU-CPU data transfer.

Moreover, as jit compiles our JAX code to XLA, the execution is no longer (or at least less) affected by the inefficiency of Python.

RL batched training setup in JAX (made by the author)

For more details and exciting applications to meta-learning RL research, I highly recommend this blog post by Chris Lu.

https://medium.com/media/51f6a09d44b486f82f1dac941ba1c012/href

Environment, Agent, and Policy implementations:

Let’s take a look at the implementation of the different parts of our RL experiment. Here’s a high-level overview of the basic functions we’ll need:

Class methods required for a simple RL setup (made by the author)

The environment

This implementation follows the scheme provided by Nikolaj Goodger in his great article on writing environments in JAX.

Writing an RL Environment in JAX

Let’s start with a high-level view of the environment and its methods. This is a general plan for implementing an environment in JAX:

https://medium.com/media/adc19912c513d5d16903c068ee5fb428/href

Let’s take a closer look at the class methods (as a reminder, functions starting with “_” are private and shall not be called outside of the scope of the class):

_get_obs: This method converts the environment state to an observation for the agent. In a partially observable or stochastic environment, the processing functions applied to the state would go here.
_reset: As we’ll be running multiple agents in parallel, we need a method for individual resets on the completion of an episode.
_reset_if_done: This method will be called at each step and trigger _reset if the “done” flag is set to True.
reset: This method is called at the beginning of the experiment to get the initial state of each agent, as well as the associated random keys
step: Given a state and an action, the environment returns an observation (new state), a reward, and the updated “done” flag.

In practice, a generic implementation of a GridWorld environment would look like this:

https://medium.com/media/61cae4744a9500cf0fa1c2b3c47b8905/href

Notice that, as mentioned earlier, all class methods follow the functional programming paradigm. Indeed, we never update the internal state of the class instance. Furthermore, the class attributes are all constants that won’t be modified after instantiation.

Let’s take a closer look:

__init__: In the context of our GridWorld, the available actions are [0, 1, 2, 3]. These actions are translated into a 2-dimensional array using self.movements and added to the state in the step function.
_get_obs: Our environment is deterministic and fully observable, therefore the agent receives the state directly instead of a processed observation.
_reset_if_done: The argument env_state corresponds to the (state, key) tuple where key is a jax.random.PRNGKey. This function simply returns the initial state if the done flag is set to True, however, we cannot use conventional Python control flow within JAX jitted functions. Using jax.lax.cond we essentially get an expression equivalent to:

def cond(condition, true_fun, false_fun, operand):
  if condition: # if done flag == True
    return true_fun(operand)  # return self._reset(key)
  else:
    return false_fun(operand) # return env_state

step: We convert the action to a movement and add it to the current state (jax.numpy.clip ensures that the agent stays within the grid). We then update the env_state tuple before checking if the environment needs to be reset. As the step function is used frequently throughout training, jitting it allows significant performance gains. The @partial(jit, static_argnums=(0, ) decorator signals that the “self” argument of the class method should be considered static. In other words, the class properties are constant and won’t change during successive calls to the step function.

Q-Learning Agent

The Q-learning agent is defined by the update function, as well as a static learning rate and discount factor.

https://medium.com/media/e4fa1d223ac28115e273440c1c292382/href

Once again, when jitting the update function, we pass the “self” argument as static. Also, notice that the q_values matrix is modified in place using set() and its value is not stored as a class attribute.

Epsilon-Greedy Policy

Finally, the policy used in this experiment is the standard epsilon-greedy policy. One important detail is that it uses random tie-breaks, which means that if the maximal Q-value is not unique, the action will be sampled uniformly from the maximal Q-values (using argmax would always return the first action with maximal Q-value). This is especially important if Q-values are initialized as a matrix of zeros, as the action 0 (move right) would always be selected.

Otherwise, the policy can be summarized by this snippet:

action = lax.cond(
            explore, # if p < epsilon
            _random_action_fn, # select a random action given the key
            _greedy_action_fn, # select the greedy action w.r.t Q-values
            operand=subkey, # use subkey as an argument for the above funcs
        )
return action, subkey

Note that when we use a key in JAX (e.g. here we sampled a random float and used random.choice) it is common practice to split the key afterward (i.e. “move on to a new random state”, more details here).

https://medium.com/media/07f0a02c79b8c098bdd782d1156ca04d/href

Single-agent training loop:

Now that we have all the required components, let’s train a single agent.

Here’s a Pythonic training loop, as you can see we are essentially selecting an action using the policy, performing a step in the environment, and updating the Q-values, until the end of an episode. Then we repeat the process for N episodes. As we’ll see in a minute, this way of training an agent is quite inefficient, however, it summarizes the key steps of the algorithm in a readable way:

https://medium.com/media/ea2f949e4a8a39891b27ab768e94500f/href

On a single CPU, we complete 10.000 episodes in 11 seconds, at a rate of 881 episodes and 21 680 steps per second.

100%|██████████| 10000/10000 [00:11<00:00, 881.86it/s]
Total Number of steps: 238 488
Number of steps per second: 21 680

Now, let’s replicate the same training loop using JAX syntax. Here’s a high-level description of the rollout function:

Training rollout function using JAX syntax (made by the author)

To summarize, the rollout function:

Initializes the observations, rewards, and done flags as empty arrays with a dimension equal to the number of time steps using jax.numpy.zeros. The Q-values are initialized as an empty matrix with shape [timesteps+1, grid_dimension_x, grid_dimension_y, n_actions].
Calls the env.reset() function to get the initial state
Uses the jax.lax.fori_loop() function to call a fori_body() function N times, where N is the timestep parameter
The fori_body() function behaves similarly to the previous Python loop. After selecting an action, performing a step, and computing the Q-update, we update the obs, rewards, done, and q_values arrays in place (the Q-update targets the time step t+1).

https://medium.com/media/5e0d8b4186660c3c723fff427e76d747/href

This additional complexity leads to an 85x speed-up, we now train our agent at roughly 1.83 million steps per second. Note that here, the training is done on a single CPU as the environment is simplistic.

However, end-to-end vectorization scales even better when applied to complex environments and algorithms benefitting from multiple GPUs (Chris Lu’s article reports a whopping 4000x speed-up between a CleanRL PyTorch implementation of PPO and a JAX reproduction).

100%|██████████| 1000000/1000000 [00:00<00:00, 1837563.94it/s]
Total Number of steps: 1 000 000
Number of steps per second: 1 837 563

After training our agent, we plot the maximal Q-value for each cell (i.e. state) of the GridWorld and we observe that it has effectively learned to go from the initial state (bottom right corner) to the objective (top left corner).

https://medium.com/media/30e43fb24a5519050ecfc1e370f31219/href

Parallel agents training loop:

As promised, now that we’ve written the functions required to train a single agent, we have little to no work left to train multiple agents in parallel on batched environments!

Thanks to vmap we can quickly transform our previous functions to work on batches of data. We only have to specify the expected input and output shapes, for instance for env.step:

in_axes = ((0,0), 0) represents the input shape, which is composed of the env_state tuple (dimension (0, 0)) and an observation (dimension 0).
out_axes = ((0, 0), 0, 0, 0) represents the output shape, with the output being ((env_state), obs, reward, done).
Now, we can call v_step on an array of env_states and actions and receive an array of processed env_states, observations, rewards, and done flags.
Note that we also jit all batched functions for performance (arguably, jitting env.reset() is unnecessary given that it is only called once in our training function).

https://medium.com/media/ff345d5bfa6606bf53df5bbb041decbc/href

The last adjustment we have to make is to add a batch dimension to our arrays to account for each agent’s data.

By doing this, we obtain a function that allows us to train multiple agents in parallel, with minimal adjustments compared to the single agent function:

https://medium.com/media/f3d04e39e05060ddf9b7e2cddfe44228/href

We get similar performances with this version of our training function:

100%|██████████| 100000/100000 [00:02<00:00, 49036.11it/s]
Total Number of steps: 100 000 * 30 = 3 000 000
Number of steps per second: 49 036 * 30 = 1 471 080

And that’s it! Thanks for reading this far, I hope this article provided a helpful introduction to implementing vectorized environments in JAX.

If you enjoyed the read, please consider sharing this article and starring my GitHub repository, thanks for your support! 🙏

GitHub - RPegoud/jym: JAX implementation of RL algorithms and vectorized environments

Finally, for those interested in digging a little deeper, here’s a list of useful resources that helped me get started with JAX and redacting this article:

A curated list of awesome JAX articles and resources:

[1] Coderized, (functional programming) The purest coding style, where bugs are near impossible, YouTube

[2] Aleksa Gordić, JAX From Zero to Hero YouTube Playlist (2022), The AI Epiphany

[3] Nikolaj Goodger, Writing an RL Environment in JAX (2021)

[4] Chris Lu, Achieving 4000x Speedups and Meta-Evolving Discoveries with PureJaxRL (2023), University of Oxford, Foerster Lab for AI Research

[5] Nicholas Vadivelu, Awesome-JAX (2020), a list of JAX libraries, projects, and resources

[6] JAX Official Documentation, Training a Simple Neural Network, with PyTorch Data Loading

Vectorize and Parallelize RL Environments with JAX: Q-learning at the Speed of Light⚡ was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.