Discrete Diffusion Reading Group

Twitter / X

Research highlights & updates

YouTube

Watch session recordings

Email List

Announcements & reminders

Discord

Chat with researchers & discuss papers

Microsoft Teams

Join live weekly sessions

Latest Sessions

View All Sessions

S21 | Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

1:12:27

June 15, 2026

S21 | Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

Samson Gourevitch (École Polytechnique), Yazid Janati (MBZUAI), and Dario Shariatian (INRIA) revisit Uniform Diffusion Models (UDMs) and show that their standard parameterization is trained by a leave-one-out posterior, which predicts each clean token without seeing its own noisy observation. Correcting for this mismatch improves UDM generation, and an absorbing-state reformulation matches masked diffusion, suggesting the gap between the two comes from parameterization and sampling rather than the choice of marginals.

Samson Gourevitch (École Polytechnique), Yazid Janati (MBZUAI), and Dario Shariatian (INRIA) present their work revisiting Uniform Diffusion Models. Discrete diffusion models are usually trained by predicting clean tokens from a noisy sequence, but that prediction can be turned into reverse dynamics in several ways. For Masked Diffusion Models (MDMs) these choices largely coincide; for Uniform Diffusion Models (UDMs) they do not. This work shows that the standard plug-in parameterization for UDMs is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This exposes a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. The authors characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score, which lets them pair either parameterization with either objective. The same conversions improve inference at no extra training cost, through an informed predictor-corrector sampler and temperature sampling applied to the leave-one-out predictor. The authors further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking step. On language modeling, leave-one-out parameterizations consistently improve UDM generation, and the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion comes less from the choice of marginals than from parameterization and sampling design.

S20 | Closing the Autoregressive Gap with Continuous Bitstream Diffusion

40:40

June 8, 2026

S20 | Closing the Autoregressive Gap with Continuous Bitstream Diffusion

Georgios Batzolis (University of Cambridge) presents a continuous diffusion language model that represents text as fixed-width binary bitstreams instead of token embeddings. An entropy-gated stochastic sampler concentrates randomness where token uncertainty is highest, which narrows the quality gap to autoregressive models, while the model predicts only O(log V) bits per token rather than a full vocabulary distribution.

Georgios Batzolis (University of Cambridge) presents Entropy-Gated Continuous Bitstream Diffusion. Diffusion language models promise parallel, order-agnostic generation, but on standard benchmarks they have long lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion models over token embeddings narrowed this gap, suggesting that continuity itself is not the bottleneck. This work pushes the idea further by diffusing over bitstreams: each semantic token is encoded as a fixed-width sequence of binary bits, embedded in continuous space, and recovered from Gaussian noise by an EDM-style denoiser. Because an isolated bit has a known closed-form posterior under Gaussian corruption, the authors introduce a matched-filter residual parameterization, in which the network computes the analytic independent-bit posterior and spends its capacity only on the contextual residual. The largest gains come from the sampler. The deterministic probability-flow sampler is already competitive but over-contractive, reaching good perplexity by undershooting real-data token entropy. An entropy-gated stochastic sampler corrects this by applying Langevin-type churn on an entropy-rate grid, concentrating randomness in high-information regions and staying nearly deterministic elsewhere. On LM1B, a 130M-parameter model reaches a generative perplexity of 59.76 at matched data entropy using 256 function evaluations, surpassing prior diffusion baselines and the autoregressive reference. On OpenWebText it reaches 27.06 using four times fewer steps than prior 1024-step baselines. Because the model predicts O(log V) bitwise logits rather than a vocabulary-sized distribution, it also uses less memory and runs faster, an advantage that grows as vocabularies do.

1:18:39

June 1, 2026

S19 | ELF: Embedded Language Flows

Keya Hu and Linlu Qiu (MIT) present ELF (Embedded Language Flows), a continuous diffusion language model that runs Flow Matching in continuous embedding space and discretizes to tokens only at the final step. This design makes it easy to adapt image-domain techniques such as classifier-free guidance.

Keya Hu and Linlu Qiu (MIT) present ELF (Embedded Language Flows). Diffusion and flow-based models are the default choice for generating continuous data such as images and video, yet today's leading diffusion language models still work over discrete tokens. ELF (Embedded Language Flows) asks whether continuous models can compete with only minimal special treatment of discreteness. ELF maps tokens into a continuous embedding space using an encoder needed only at training time, then learns a continuous-time Flow Matching process that denoises embeddings from Gaussian noise toward clean ones. It parameterizes the network to predict the clean embedding rather than the velocity, which lets a single shared-weight network handle both denoising and the final decoding step. Discretization happens only at the last step, where that same network projects the embedding through a learned unembedding matrix to token logits, so ELF needs no separate decoder at inference. Because the trajectory stays continuous almost everywhere, techniques from image diffusion such as classifier-free guidance carry over with little change. Across open-web text, machine translation, and summarization, ELF reaches lower generative perplexity than leading discrete models (MDLM, Duo) and concurrent continuous ones (FLM, LangFlow), using fewer sampling steps, roughly ten times fewer training tokens, and no distillation.

Featured Videos

View All Videos

22:14

February 9, 2026

How did diffusion LLMs get so fast?

Techniques for accelerating diffusion LLMs, from self-distillation and curriculum learning to KV caching and block diffusion

This video discusses techniques for making diffusion LLMs faster, including self-distillation through time, curriculum learning, confidence scores for unmasking, guided diffusion (FlashDLM), approximate KV caching (dLLM-Cache, dKV-Cache), and block diffusion.

But How Do Diffusion Language Models Actually Work?

12:27

August 3, 2025

But How Do Diffusion Language Models Actually Work?

Jia-Bin Huang explores several ideas for applying diffusion models to language modeling

Most Large Language Models (LLMs) today are based on Autoregressive models (i.e., they predict texts in a left-to-right order). But diffusion models offer iterative refinement, flexible control, and faster sampling. In this video, we explore several ideas for applying diffusion models to language modeling.

15:07

July 3, 2024

Simple Diffusion Language Models

Quick introduction to Masked Diffusion Language Models (MDLM) by Alexander Rush

About the Reading Group

Diffusion LLMs are faster, more controllable successors to traditional LLMs and are rapidly gaining adoption. This reading group builds a community for exchanging and debating emerging ideas in this space. While our primary focus is discrete diffusion models for language, we also welcome work on other modalities and applications, such as molecular design, drug discovery, and beyond.

Meet the Organizers

Subham Sahoo

Holds a Ph.D. from Cornell Tech, where he specialized in Diffusion Language Models. He has made foundational contributions to the field, with his work deployed at scale by Google, NVIDIA, and ByteDance across language generation and drug discovery.

Justin Deschenaux

PhD student in Machine Learning at EPFL, advised by Prof. Caglar Gulcehre. Previously interned at Apple MLR. His research interests include diffusion language models, fast generative models, and generalization.

Zhihan Yang

PhD student at Cornell CS. Previously completed his Bachelor's degrees in Mathematics and Statistics at Carleton College. He is a winner of the CRA Outstanding Undergraduate Researcher Award and his research focuses on principled, controllable, and efficient generative models.