Before that, I was at the Llama research team. I spearheaded the prototype and algorithmic recipes for online RL and, as part of a small team, scaled the training to Llama 3.3-4. I also worked on post-training for reasoning.
Et avant ça, I spent a few years at DeepMind London, where I was a core contributor to Gemini v1-1.5 post-training with a focus on tool use and agent. I also researched various aspects of deep RL algorithms and systems.
Previously, I was a two-time intern at DeepMind Paris hosted by Rémi Munos. I obtained my PhD at Columbia University in New York City.
[2/2026]Opus 4.6 and its predecessor 4.5 achieve frontier performance across the board, especially agentic coding, kudos to the team.
[9/2025]Magistral 1.2 achieves frontier performance on reasoning and coding benchmarks.
[6/2025]Magistral is the first Mistral reasoning model. All credits to the amazing team! See here for details.
[6/2025]LlamaRL is the first large-scale RL stack internal to Llama research. Thank you to my close collaborators!
[5/2025] Our new work on scaling RL to unverifiable domains such as long-form data is out!
[4/2025]Llama 4 is out (see blog post here). It is the first major Llama release trained with a large-scale RL stack.
[12/2024]Llama 3.3 is out. Great to have spearheaded the RL stack which produces a highly performant yet cost-efficient open source model.
[7/2024] Great to be part of the project led by my excellent collaborators to benchmark scalable-oversight protocols.
[5/2024] We investigated the importance of on-policy sampling in language model alignment, check it out here!
[5/2024]Four papers are accepted at ICML 2024. Thank you for the heavy lifting to my coauthors.
[3/2024] Gemini 1.5 is announced. Check out the tech report here.
[12/2023]Gemini is launched. Great to be a core contributor to Gemini, the most powerful multi-modal large language model developed by Google DeepMind. Check out the tech report here.
Research
Besides building frontier models, I also enjoy researching science.
My past work focused on the understanding and developments of deep reinforcement learning algorithms and systems, spanning the following non-exhaustive list of topics
Magistral is the first reasoning model by Mistral. The technical report documents valuable practical insights for reasoning focused post-training. All credits to the team.
We highlight two pitfalls on building KL gradient estimate for RL applications, as well as their practical impact. Such pitfalls are commonly observed in recent RL implementations for LLM fine-tuning and research papers.
LlamaRL is the first distributed asynchronous off-policy RL stack that powers large-scale Llama training internal to Llama research. In this report, we detail the key infra and algorithmic designs that help stabilize RL training of LLM at scale. Example applications include Llama 3.3, Llama 4 and their future extended releases.
We extend RL training beyond verifiable reward with Jensen's Evidence lower bound Policy Optimization (JEPO). JEPO can train on long-form data with no easy verifier and is compatible with large-scale training stacks. Our work unlocks a new paradigm for fine-tuning models with richer data sources.
We analyze an asymetric version of REINFORCE algorithm, which naturally appears in the off-policy case. The baseline function plays a key role here beyond variance reduction.
We find that simple inference aware finetuning algorithms can greatly improve test time performance, as evaluated on a large suite of code generation and reasoning tasks.
We find that using soft policy optimization - a novel policy optimization method inspired by regularized RL and specialized to sequence learning - outperforms PPO on code generation tasks.
Llama 3.3 is the first Llama model trained with a large-scale RL stack, reasonable in model size while approaching the performance of Llama 3.1 405B in certain domains.
Is online RL really necessary for AI alignment, or do offline algorithms suffice? The answer seems to be yes according to our careful ablations.
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton*, Noah Y. Siegel*, Janos Kramar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah
Arxiv, NeurIPS 2024
We have benchmarked important existing scalable-oversight protocols in a comprehensive suite of QA tasks, opening the path for further future investigation.
We seek to close another important theory-practice gap in distributional RL - the KL-divergence based implementation of categorical projection.
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Rémi Munos, Bilal Piot
Arxiv,
When human feedback is pointwise rather than pairwise, we propose direct reward optimization (DRO) as the alignment algorithm.
Online preference optimization as an alignment technique turns out to be intimately related to Nash equilibrium, besides being a competitive algorithm for RLHF.
GPO unifies alignment algorithms such as DPO, IPO and SLiC as special cases. The insight, interestingly, is based on classic literature on convex losses for binary classification. At the end of the day, all algorithmic variants have similar performance-regularization trade-off though their natural strengths of regularization differ.
One of the most powerful multi-modal large language models thus far in the world.
Nash Learning from Human Feedback
Rémi Munos*, Michal Valko*, Daniele Calandriello*, Mohammad Gheshlaghi Azar*, Mark Rowland*, Daniel Guo*, Yunhao Tang*, Matthieu Geist*, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup and Bilal Piot* Arxiv, ICML 2024
In aligning large language models, we search for Nash Equilibrium naturally defined via the pairwise human feedback. This approach is more general purpose, imposes fewer assumptions on reward modeling, and performs better than canonical RLHF.
We propose VA-learning as a more sample efficient alternative to Q-learning. The sample efficiency stems from the value sharing between different actions. Intriguingly, VA-learning closely relates to dueling architecture.
We design an off-policy actor-critic algorithm based on multi-step policy improvement and policy evaluation. This algorithm improves state-of-the-art IMPALA baseline.
We provide a characterization on how TD-learning learns representations, relating random reward based TD-learning with spectral decomposition of the transition matrix.
Quantile Credit Assignment
Thomas Mesnard, Wenqi Chen, Alaa Saade, Yunhao Tang, Mark Rowland, Theophane Weber, Clare Lyle, Audrunas Gruslys, Michal Valko, Will Dabney, Georg Ostrovski, Eric Moulines, Rémi Munos
Arxiv, ICML 2023, Oral
Efficient credit assignment should account for external factors outside of agent's control, or more informally, the level of luck. We formalize such intuitions into quantile credit assignment.
We show that in certain cases quantile TD outperforms TD in mean value prediction. This hints at a general potential of distributional RL to outperform mean-based RL at its own game.
Understanding Self-Predictive Learning for Reinforcement Learning Yunhao Tang, Zhaohan Daniel Guo, Pierre Harvey Richemond, Bernardo Avila Pires, Yash Chandak, Rémi Munos, Mark Rowland, Mohammad Gheshlaghi Azar, Charline Le Lan, Clare Lyle, Andras Gyorgy, Shantanu Thakoor, Will Dabney, Bilal Piot, Daniele Calandriello, Michal Valko
Arxiv, ICML 2023
Self-predictive learning is a popular representation learning algorithm in RL, which learns a latent representation by predicting (bootstrapping) its own future latents. Intuitively, the algorithm should not work as it can collapse to trivial solutions. We identify algorithmic components to prevent the collapse and show that self-preditive learning is related to gradient-based spectral decomposition of the transition dynamics.
We provide a first proof of the convergence of quantile TD-learning, a distributional RL algorithm that drives multiple recent empirical breakthroughs.
We identify a few intriguing and fundamental differences between value-based TD-learning and distributional TD-learning.
BYOL-Explore: Exploration by Bootstrapped Prediction
Zhaohan Daniel Guo*, Shantanu Thakoor*, Miruna Pislar*, Bernardo Avila Pires*, Florent Altche*, Corentin Tallec*, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Rémi Munos, Mohammad Gheshlaghi Azar*, Bilal Piot* Arxiv, NeurIPS 2022
We find that self-prediction loss is a surprisingly useful signal for exploration in extremely challenging deep RL domains. Our method: BYOL-explore, partially cracks a wide range of extremely hard exploration problems much more efficiently than prior methods.
How to estimate high-order derivatives of value functions? We propose a unifying framework with off-policy evaluation. Direct differentiations of off-policy estimates produce estimates to high-order derivatives of value functions, and instantiate many prior methods as special cases.
Uncorrected multi-step updates such as n-step Q-learning are ubiquitous in modern deep RL practices. We revisit Peng's Q($\lambda$), a classic uncorrected multi-step variant. Our analysis sheds light on why uncorrected updates should work in practice. The empirical result also suggests significant gains on benchmark tasks.
We propose Hindsight Expectation Maximization (hEM), an EM algorithm for goal-conditioned RL problem which combines supervised learning through the M-step and hindsight goal sampling through the E-step. We also make an intimate connection between hindsight replay and importance sampling for rare event simulations.
Why is self-imitation learning efficient? We shed light on its connections to n-step Q-learning and show that part of its gains might be attributed to trade-offs in RL operators. We also propose a n-step extension of self-imitation learning which incorporates the strengths of both n-step updates and lower-bound learning.
We establish an interpretation of MCTS as policy optimization. This interpretation leads to algorithmic variants which naturally improve over MCTS-based baselines such as AlphaZero and MuZero.
We estabilish the intimate connections between trust region policy search and off-policy evaluation. The new algorithm TayPO generalizes policy optimization objectives to high-order extentions which leads to gains on large-scale distributed agents.
We formulate cutting plane algorithms as a sequential decision making problem for generic integer pgroamming. The cutting plane agent learned via RL improves over human-designed heuristics and benfits downstream applications such as branch-and-cut.