Releases · huggingface/trl

17 Apr 01:13

qgallouedec

v1.2.0

aca4515

v1.2.0 Latest

Latest

Features

New `SSDTrainer` — Simple Self-Distillation

A new experimental SSDTrainer implements the method described in Embarrassingly Simple Self-Distillation Improves Code Generation. SSD samples completions from the model itself at a training-time temperature/truncation setting, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. No reward model, verifier, teacher model, or RL: just prompts and the model.

from datasets import Dataset
from trl.experimental.ssd import SSDConfig, SSDTrainer

dataset = Dataset.from_dict({
    "prompt": [
        [{"role": "user", "content": "Write a function to add two numbers."}],
        [{"role": "user", "content": "Write a function to check if a number is prime."}],
    ],
})

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(
        output_dir="ssd-model",
        temperature=0.6,      # T_train from the paper
        top_k=20,
        top_p=0.95,
        learning_rate=5e-6,
    ),
    train_dataset=dataset,
)
trainer.train()

by @kashif in #5505

Drop, don't truncate, overlong tool results in `GRPOTrainer`

When tool calls produce more tokens than max_completion_length allows, GRPOTrainer now rolls back the tool messages/images added in the current iteration instead of trying to truncate them. This removes ~80 lines of fragile, image-boundary-aware bookkeeping in favor of a ~15-line snapshot-and-rollback. Since overlong samples almost always get rewarded as failures anyway, the learning signal is effectively unchanged — but the code is dramatically simpler and no longer needs per-VLM-family vision-token lookup tables.

by @qgallouedec in #5521

Expanded tool-calling model support: LLaMA 3.1 / 3.2 & DeepSeek-V3

Continuing the effort from v1.1:

LLaMA 3.1 and 3.2 tool-calling response schemas, with dedicated templates for identity matching. Note that these templates only support a single tool call and no content alongside the tool call — limitations inherited from the models' native templates. By @qgallouedec in #5518
DeepSeek-V3 training chat template with {% generation %} markers, enabling assistant-only loss masking for DeepSeek-V3 models. By @RudrenduPaul in #5527

As a result of a tightened detection (see fixes below), the list of templates reported as tool-calling capable is now correct — notably, the basic Llama 3 template is no longer falsely classified as tool-calling capable.

KTO/DPO alignment push

A major cleanup sweep keeps KTOTrainer and DPOTrainer in lockstep, same initialization patterns, same config surface, same precompute behavior:

Add precompute_ref_batch_size to KTO (#5530)
Align ref_model initialization (#5534)
Align model initialization (#5533)
Support None args (#5531)
Remove generate_during_eval (#5551)
Remove model and ref adapter names (#5552)
Don't load ref_model when precompute_ref_log_probs is set in DPO/KTO (#5542)

All by @albertvillanova.

Other

Support messages with images in prepare_multimodal_messages by @albertvillanova in #5474
Simplify role handling in prepare_multimodal_messages by @albertvillanova in #5508
Update vLLM version support to 0.18.0 by @qgallouedec in #5547

Fixes

Fix supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in #5517
Fix add_response_schema for VLM processors — the schema was being set on the outer processor instead of the inner tokenizer, so it had no effect. This also collapses a handful of __init__/decode-gate workarounds. By @qgallouedec in #5520
Remove xfail condition for Gemma 4 response_schema regex bug by @qgallouedec in #5510
Remove unused dependencies for judges from dev requirements by @qgallouedec in #5515

Deprecations

Deprecate use_transformers_paged in GRPOConfig and RLOOConfig (and remove entirely from experimental OnlineDPOConfig, GOLDConfig, SelfDistillationConfig). Will be removed from the remaining configs in v2.0.0. In a small A/B benchmark (Qwen3-0.6B GRPO), the paged path is ~20% slower and uses ~6x more peak VRAM than the default; it's also superseded by transformers continuous batching. By @qgallouedec in #5544

Documentation and Examples

Add example script section to experimental trainer docs by @sergiopaniego in #5543
[Docs] Fix formatting in SSD training example script by @kashif in #5548
Nits in SSD docs by @sergiopaniego in #5554
[docs] Add LLaMA 3 / Qwen 2.5 entries to chat_templates/README by @qgallouedec in #5545
Update CARLA VLM example scripts by @sergiopaniego in #5557

CI

Fix CI dependency installs to use a single resolve by @qgallouedec in #5513
Set upper transformers version to skip distributed test_rloo after fixed by @albertvillanova in #5535
Update tests with zero3 for RLOO and GRPO once fixed in transformers 5.5.4 by @albertvillanova in #5541
Bump doc-builder SHA for PR upload workflow by @rtrompier in #5553

What's Changed

⬆️ Bump dev version by @qgallouedec in #5525
Simplify role handling in prepare_multimodal_messages by @albertvillanova in #5508
Fix CI dependency installs to use a single resolve by @qgallouedec in #5513
Fix supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in #5517
feat: add DeepSeek-V3 training chat template with generation markers by @RudrenduPaul in #5527
Drop, don't truncate, overlong tool results in GRPOTrainer by @qgallouedec in #5521
Set upper transformers version to skip distributed test_rloo after fixed by @albertvillanova in #5535
Align KTO with DPO: Add precompute_ref_batch_size by @albertvillanova in #5530
Update tests with zero3 for RLOO and GRPO once fixed in transformers 5.5.4 by @albertvillanova in #5541
Align KTO with DPO: Align ref_model initialization by @albertvillanova in #5534
Align KTO with DPO: Align model initialization by @albertvillanova in #5533
Remove unused dependencies for judges from dev requirements by @qgallouedec in #5515
Remove xfail condition for Gemma4 response_schema regex bug by @qgallouedec in #5510
Align KTO with DPO: Support None args by @albertvillanova in #5531
Add example script section to experimental trainer docs by @sergiopaniego in #5543
[SSD] Added SSD trainer in experimental by @kashif in #5505
[Docs] Fix formatting in SSD training example script by @kashif in #5548
Don't load ref_model when precompute_ref_log_probs in DPO/KTO by @albertvillanova in #5542
chore: bump doc-builder SHA for PR upload workflow by @rtrompier in #5553
Nits is SSD docs by @sergiopaniego in #5554
Deprecate use_transformers_paged by @qgallouedec in #5544
Update vLLM version support to 0.18.0 by @qgallouedec in #5547
Align KTO with DPO: Remove generate_during_eval by @albertvillanova in #5551
Align KTO with DPO: Remove model and ref adapter names by @albertvillanova in #5552
Support messages with images in prepare_multimodal_messages by @albertvillanova in #5474
Update CARLA VLM example scripts by @sergiopaniego in #5557
Fix add_response_schema for VLM processors by @qgallouedec in #5520
[docs] Add LLaMA 3 / Qwen 2.5 entries to chat_templates/README by @qgallouedec in #5545
Add LLaMA 3.1 and 3.2 tool calling support by @qgallouedec in #5518
Release: v1.2 by @qgallouedec in #5576
...

Contributors

kashif, rtrompier, and 4 other contributors

Assets 2

12 Apr 02:15

qgallouedec

v1.1.0

3179965

v1.1.0

Features

`DistillationTrainer` for efficient on-policy distillation

Read the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer

The new DistillationTrainer implements on-policy knowledge distillation as described in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. It extends the ideas from the GKDTrainer with three key optimizations: a generation buffer that decouples the training microbatch size from the generation batch size (up to 40x speedup), external teacher server support so the teacher doesn't need to fit on training GPUs, and binary-encoded logprob payloads that shrink transfer payloads by ~5x.

from datasets import load_dataset
from trl.experimental.distillation import DistillationConfig, DistillationTrainer

dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(
    lambda x: {"messages": [{"role": "user", "content": x["question"]}]},
    remove_columns=dataset.column_names,
)

trainer = DistillationTrainer(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    teacher_model="Qwen/Qwen2.5-7B-Instruct",
    args=DistillationConfig(
        output_dir="results/distill-qwen-gsm8k",
        lmbda=1.0,                   # fully on-policy (student generates)
        beta=1.0,                    # reverse KL
        teacher_model_init_kwargs={"torch_dtype": "bfloat16"},
    ),
    train_dataset=dataset,
)
trainer.train()

by @cmpatino in #5407, #5500 and #5501

Chunked LM head for memory-efficient log-prob computation in `AsyncGRPOTrainer`

AsyncGRPOTrainer now supports a chunked LM-head path that computes per-token log-probs and entropy via online logsumexp without materializing the full [N, V] logits tensor. Combined with completion_mask filtering to skip prompt tokens, this brings massive memory savings on long sequences — up to 44x lower peak-allocated memory on an 8192-token sequence:

`chunk_lm_head_size`	Peak Alloc (GB)	Reduction	Wall Time (ms)
`None` (baseline)	18.55	1.00x	808.7
`4096`	0.42	44.32x	459.0
`8192`	0.76	24.34x	393.0

Enable it via the new chunk_lm_head_size option in AsyncGRPOConfig:

from trl.experimental.async_grpo import AsyncGRPOConfig, AsyncGRPOTrainer

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    args=AsyncGRPOConfig(chunk_lm_head_size=4096),
    ...
)

Note: mutually exclusive with use_liger_kernel (both replace the LM head forward pass).

by @AmineDiro in #5349

`{% generation %}` support in training chat templates

SFT with assistant_only_loss=True requires chat templates to include {% generation %} / {% endgeneration %} markers so that return_assistant_tokens_mask=True produces correct masks. Very few models ship these markers natively, so users hit a cryptic error when enabling assistant-only loss with models like Qwen3, Llama 3 or GPT-OSS.

SFTTrainer now automatically swaps in a patched training chat template when the original template lacks generation markers — no manual template surgery required. Training templates are shipped for Qwen2.5, Qwen3, Llama 3 and GPT-OSS, stored as standalone .jinja files under trl/chat_templates/ for readability, diffability, and editor syntax highlighting.

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3-4B",
    args=SFTConfig(assistant_only_loss=True),  # now just works
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5459, #5470, by @RudrenduPaul in #5493 and #5522, and by @casinca in #5484

Expanded tool-calling model support

Agent training now supports a broader family of models via native tool-call response schemas:

GPT-OSS (#5464)
GLM-4-MoE (#5463)
Qwen3-VL (#5469)
Gemma 4 — the first model to natively ship a response schema (#5454)

A new supports_tool_calling() utility detects whether a tokenizer/processor can render a full tool-calling turn, and GRPOTrainer now validates tool support at initialization — raising a clear error upfront instead of failing cryptically mid-training.

by @qgallouedec in #5462, #5464, #5463, #5469 and #5454

Multimodal tool responses for VLM training

environment_factory tool methods can now return multimodal content blocks (images + text) for VLM training. Previously, tool responses were always converted to str(result), discarding any visual information. Now tools can return content block lists with images, and the trainer handles them end-to-end through tokenization, generation, and the forward pass — including correct pixel_values plumbing.

class ScreenshotEnv:
    def take_screenshot(self) -> list[dict]:
        return [
            {"type": "image", "image": self.browser.screenshot()},
            {"type": "text", "text": "Current page state"},
        ]

The OpenEnv browsergym.py example has been migrated to this pattern, and a new carla_vlm.py example demonstrates VLM training against CARLA with camera-image tool responses.

by @sergiopaniego in #5323 and #5437, and by @qgallouedec in #5448

Built-in reward functions now log extra columns

accuracy_reward and reasoning_accuracy_reward now emit extra diagnostic columns (solution, gold_parsed, answer_parsed) via the log_extra callback introduced in v1.0.0. These show up in the rich completions table, making it much easier to debug why a reward was (or wasn't) assigned.

from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    args=GRPOConfig(log_completions=True),
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5308

Other

Align KTO with DPO: precompute reference log probs at init by @albertvillanova in #5447
Align KTO with DPO: reorganize KTOConfig by @albertvillanova in #5477
Use generic VLM key passthrough in DPO by @albertvillanova in #5468
Make images optional in prepare_multimodal_messages by @albertvillanova in #5424
Avoid image deepcopy in prepare_multimodal_messages by @albertvillanova in #5475
Replace pixel_position_ids with image_position_ids for Gemma 4 support by @qgallouedec in #5452
Update vLLM minimum supported version to 0.11.0 by @albertvillanova in #5443
Remove dead token attributes from trainers by @albertvillanova in #5483
Remove unnecessary isinstance(part, dict) checks in image extraction by @qgallouedec in #5439
Simplify _get_tool_suffix_ids by @qgallouedec in #5440
Narrow prefix-preserving check to the actual requirement by @qgallouedec in #5458
Remove duplicated prepare_deepspeed by @albertvillanova in #5414

Fixes

Fix targeting fused parameters with LoRA by @BenjaminBossan in #5430
Fix ImportError with vllm-0.10.2 in OnlineDPO and OpenEnv by @albertvillanova in #5423
Fix _get_per_token_logps_and_entropies return type by @kashif in #5456
Fix SFT deprecation warning by @albertvillanova in #5466
Fix broken validation of user-specified tokens by @albertvillanova in #5482
Fix prepare_multimodal_messages not normalizing empty string content for assistant/tool roles by @albertvillanova in #5496
Remove redundant alignment of pad_token_id by @albertvillanova in #5487
Replace deprecated huggingface-cli references with hf by @hanouticelina in #5486
Remove unused truncation_mode from experimental truncate_dataset by @albertvillanova in #5467
Fix PR template check bot reopen loop by @qgallouedec in #5488
R...

Contributors

kashif, BenjaminBossan, and 11 other contributors

Assets 2

31 Mar 14:15

qgallouedec

v1.0.0

f3e9ac1

v1.0.0

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.

from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5293

Variational Sequence-Level Soft Policy Optimization (VESPO)

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in #5199

Divergence Proximal Policy Optimization (DPPO)

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in #5117

Self-Distillation Policy Optimization (SDPO)

SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.

from trl.experimental import SDPOTrainer, SDPOConfig

config = SDPOConfig(
    output_dir="./results",
    num_generations=8,
    success_reward_threshold=1.0,
    use_successful_as_teacher=True,
)

trainer = SDPOTrainer(
    model="Qwen/Qwen2.5-Math-1.5B-Instruct",
    reward_funcs=[accuracy_reward],
    args=config,
    train_dataset=dataset,
)
trainer.train()

by @MengAiDev in #4935

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards

by @manueldeprada in #5233

Tool calling support in `VLLMClient.chat()`

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in #4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

by @mariosasko in #5189

[GKD] Buffer implementation and vLLM inference for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.

by @cmpatino in #5137 and #5388

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in #5255

Other

Change default vllm_mode to "colocate" by @qgallouedec in #5255
Support truncation_mode in SFT by @albertvillanova in #5306
Support max_length in DPO VLM training by @albertvillanova in #5284
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180
Support sequence sampling in Liger Kernel by @michaelroyzen in #5190
Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
Add support for raw token IDs in vLLM client prompts by @qgallouedec in #5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
Enhance print_prompt_completions_sample to include reasoning content by @qgallouedec in #5327
Add support for pixel_position_ids vision key by @qgallouedec in #5374
Add second version of Qwen 3.5 chat template by @apardyl in #5405
Pass tools as None to apply_chat_template when it is an empty list by @rabinadk1 in #5380

Fixes

Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
[GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
[CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
Clean up model update group on worker exit by @AmineDiro in #5325
Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in #5330
Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354

Documentation and Examples

Add minimal CARLA example script by @sergiopaniego in #5161
...

Contributors

winglian, falcondai, and 23 other contributors

Assets 2

20 Mar 23:55

qgallouedec

v1.0.0rc1

134e015

v1.0.0rc1 Pre-release

Pre-release

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in #5199

Divergence Proximal Policy Optimization (DPPO)

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in #5117

Reward functions can now log extra columns and scalar metrics

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards

by @manueldeprada in #5233

Tool calling support in `VLLMClient.chat()`

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in #4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

by @mariosasko in #5189

[GKD] Buffer implementation for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.

by @cmpatino in #5137

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in #5255

Other

Change default vllm_mode to "colocate" by @qgallouedec in #5255
Support truncation_mode in SFT by @albertvillanova in #5306
Support max_length in DPO VLM training by @albertvillanova in #5284
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180
Support sequence sampling in Liger Kernel by @michaelroyzen in #5190
Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
Add support for raw token IDs in vLLM client prompts by @qgallouedec in #5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227

Fixes

Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
[GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
[CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
Clean up model update group on worker exit by @AmineDiro in #5325

Documentation and Examples

Add minimal CARLA example script by @sergiopaniego in #5161
Nemotron 3 examples added by @sergiopaniego in #5272
Align docs about tool calling in trainers with dataset format by @albertvillanova in #5311
Add repository-specific guidance for agents (AGENTS.md) by @qgallouedec in #5236
Align documentation with the intended public API by @qgallouedec in #5162

What's Changed

⬆️ Bump dev version by @qgallouedec in #5182
Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
Document parameters with differing default values in core configs by @albertvillanova in #5168
Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in #5169
Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in #5170
Add minimal CARLA example script by @sergiopaniego in #5161
Align documentation with the intended public API by @qgallouedec in #5162
Fix deprecation warning of create_reference_model by @albertvillanova in #5184
Fix deprecation warning of fork in multi-threaded process by @albertvillanova in #5185
Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in #5186
Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in #5187
Fix CI tests patching BaseTrainer by @albertvillanova in #5192
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180
Re-add liger-kernel to dev deps by @qgallouedec in #5164
Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in #5197
Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in #5190
Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in #5202
Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in #5204
Add support for Qwen3.5 f...

Contributors

winglian, falcondai, and 18 other contributors

Assets 2

20 Mar 03:57

qgallouedec

v0.29.1

617406a

v0.29.1

What's Changed

Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
Simplify logic for structured outputs across vLLM versions by @albertvillanova in #5215
Add support for raw ids in prompts in vLLM client and server by @qgallouedec in #5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
Move rollout_func from _generate_single_turn to _generate by @qgallouedec in #5232
[GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in #5238
Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in #5259
[GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn by @qgallouedec in #5239
[GRPO/RLOO] Extract tokenize prompts from _generate_single_turn by @qgallouedec in #5240
[CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in #4639
Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5258
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in #5242

New Contributors

@davmels made their first contribution in #4639
@falcondai made their first contribution in #5302

Full Changelog: v0.29.0...v0.29.1

Contributors

falcondai, albertvillanova, and 3 other contributors

Assets 2

25 Feb 22:38

qgallouedec

v0.29.0

d24e194

v0.29.0

Features

Add `environment_factory` to `GRPOTrainer`

GRPOTrainer now accepts an environment_factory argument, allowing users to specify a custom environment class for training. This enables more flexible and diverse training scenarios by letting users define their own environments with specific dynamics and reward structures.

from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer

dataset = Dataset.from_dict({
    "prompt": [[{"role": "user", "content": f"Increment the counter by {i}."}] for i in range(1, 7)]
})

def reward_func(environments, **kwargs):
    return [env.counter for env in environments]

class IncrementEnv:
    def reset(self):
        self.counter = 0

    def increment(self, step: int) -> int:
        """
        Increment the internal counter.

        Args:
            step: Value to add to the counter.

        Returns:
            The updated counter value.
        """
        self.counter += step
        return self.counter

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(chat_template_kwargs={"enable_thinking": False}),
    train_dataset=dataset,
    reward_funcs=reward_func,
    environment_factory=IncrementEnv,
)
trainer.train()

by @qgallouedec in #5093

Skills

TRL introduces agent-native CLI Integration: trl-training, a first-class Agent Skill that exposes TRL’s training workflows (SFT, DPO, GRPO, etc.) in a structured, agent-readable format. The skill is packaged directly with the trl library and can be installed via the CLI:

# Install into the project's agent directory (default scope=project), by agent name: claude, codex, opencode
trl skills install trl-training --target <agent>

This enables AI agents to safely and reproducibly execute TRL training workflows using a well-defined interface.

Skills can be installed at the project or global scope, and support explicit targets and overwrite controls.

Implement Agent Skills [1/N]: Create training skill (MVP) by @albertvillanova in #5096
Implement Agent Skills [2/N]: Create skills module by @albertvillanova in #5097
Implement Agent Skills [3/N]: Create skills installer by @albertvillanova in #5100
Implement Agent Skills [4/N]: Create skills CLI by @albertvillanova in #5103

Other

Pass vllm_is_ratio to LigerFusedLinearGRPOLoss in compute_liger_loss by @yukiu00 in #5031
feature: top_k selective_log_softmax by @LeonEricsson in #5104
Add Trackio integration for model card visualization by @qgallouedec in #5101
Update tool handling to support JSON string schemas in trainers by @qgallouedec in #5118
Refactor DPO by @qgallouedec in #3906
Add support for Python 3.14 by @albertvillanova in #4225
Fix default learning_rate in PPO according to paper by @albertvillanova in #5174
Fix default learning_rate in BCO according to paper by @albertvillanova in #5173
feature: Configurable num logprobs in vLLM generation by @LeonEricsson in #5107

Fixes

[GRPO] fix: remove SAPO temperature check by @LeonEricsson in #5042
fix: Use launch_args for all trainers by @qgallouedec in #5059
Fix GRPO multi-turn training with liger kernels by @albertvillanova in #4975
fix: Set num_labels to 1 in causal model initialization for RewardTrainer by @qgallouedec in #5066
[SFT] Fix high vRAM consumption during eval with liger kernel by @LoganVegnaSHOP in #5069
Fix BFD packing for SFT datasets by @albertvillanova in #5076
Fix DPO and RLOO incompatibility with FSDP2 by @flutist in #4838
Fix SFT loss type rewards being overwritten in dpo_loss() by @Mr-Neutr0n in #5079
Fix Qwen3 schema by @qgallouedec in #5111
Add check for None in get_trackio_space_url() to prevent errors by @qgallouedec in #5115
Fix trl <command> --help TypeError caused by unescaped % in TrainingArguments help strings by @albertvillanova in #5135
Fix PPOTrainer.save_model by @albertvillanova in #5151
Fix SFTTrainer support for single-image data by @qgallouedec in #5132
Fix structured_outputs handling and tool normalization in vLLM backend by @ehofm in #5155
fix: wake up vLLM weights before sync to prevent writes to freed memory by @bledden in #5147
Accept mm_token_type_ids in GRPO/RLOO _get_per_token_logps_and_entropies by @albertvillanova in #5176

Documentation and Examples

[minor] docs: typo in grpo_trainer.md by @casinca in #5047
docs: add DeepSeek-R1 training dynamics and GRPO example by @JenWei0312 in #5053
docs: Add INTELLECT-2 (2505.07291) to Paper Index by @behroozazarkhalili in #5061
docs: Add REINFORCE++ (2501.03262) to Paper Index by @behroozazarkhalili in #5062
docs: Add XPO (2405.21046) to Paper Index by @behroozazarkhalili in #5068
docs: Add RPO paper (2405.16436) to paper index by @behroozazarkhalili in #5070
docs: Add SimPO paper (2405.14734) to paper index by @behroozazarkhalili in #5071
docs: Add TR-DPO paper (2404.09656) to paper index by @behroozazarkhalili in #5078
docs: Add ORPO paper (2403.07691) to paper index by @behroozazarkhalili in #5080
docs: Add CPO paper (2401.08417) to paper index by @behroozazarkhalili in #5081
docs: Add GKD paper (2306.13649) to paper index by @behroozazarkhalili in #5082
docs: Add PRM paper (2211.14275) to paper index by @behroozazarkhalili in #5083
docs: Add T5 packing paper (1910.10683) to paper index by @behroozazarkhalili in #5084
docs: Add PPO paper (1707.06347) to paper index by @behroozazarkhalili in #5085
docs: Add MPO paper (2411.10442) to paper index by @behroozazarkhalili in #5089
docs: add Multi-Node Training subsection (#4384) by @nabin2004 in #5091
docs: Unify model examples to use trl-lib namespace by @behroozazarkhalili in #4431
Add Tiny Aya tool calling examples (script/notebook) by @sergiopaniego in #5123
Fix wording in DPO and SFT trainer documentation for clarity by @qgallouedec in #5140
Fix type of TrainingArguments.logging_steps in docs by @albertvillanova in #5149
Fix Liquid syntax error in DPO trainer docs caused by double braces in LaTeX by @albertvillanova in #5153
Document parameters with differing default values in experimental configs by @albertvillanova in #5172

Deprecations

Remove deprecated BCO after moved to experimental by @albertvillanova in #5045
Remove deprecated CPO after moved to experimental by @albertvillanova in #5046
Remove deprecated Judges after moved to experimental by @albertvillanova in #5048
Remove deprecated ORPO after moved to experimental by @albertvillanova in #5050
Remove deprecated PPO after moved to experimental by @albertvillanova in #5051
Remove deprecated PRM after moved to experimental by @albertvillanova in #5052
Remove deprecated XPO after moved to experimental by @albertvillanova in #5055
Remove deprecated RLOOConfig.max_prompt_length by @albertvillanova in #5056
Remove deprecated classes moved to experimental by @albertvillanova in #5044
Remove deprecated mergekit_utils moved to experimental by @albertvillanova in #5057
Rename input keys in RewardTrainer collator from chosen/rejected_input_ids to chosen/rejected_ids by @qgallouedec in #5179

CI Improvements

Upgrade GitHub Actions to latest versions by @salmanmkc in #4893
Remove duplicated tests for SFT and add gradient checkpointing tests by @qgallouedec in #5054
Up...

Contributors

albertvillanova, sergiopaniego, and 13 other contributors

Assets 2

10 Feb 13:28

albertvillanova

v0.28.0

49ef334

v0.28.0

Features

[GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in #4742
Add retry strategy to vLLM Client for increased robustness by @apalmas-saifh in #4845
Enable vLLM sleep mode for generation in Online DPO by @winglian in #4882
Support tool call data in is_conversational by @qgallouedec in #4923
[GRPO] Add parquet logging for completions with individual rewards by @qgallouedec in #4818
Update wordle.py example with masking of env tokens by @sergiopaniego in #4895
NeMo-Gym Integration by @cmunley1 in #4848

Experimental

Refactor KTO coordinated with DPO [c/N]: Remove ref_model_init_kwargs by @albertvillanova in #4837
Refactor KTO coordinated with DPO [e/N]: Remove label_pad_token_id by @albertvillanova in #4875
Refactor KTO coordinated with DPO [d/N]: Remove base_model_attribute_name by @albertvillanova in #4862
Fix type hint in openenv/utils.py: fallback for no vLLM installed case by @Datta0 in #4868
Remove label_pad_token_id from experimental trainers by @albertvillanova in #4878
GOLD training speed up by @141forever in #4888
Remove ref_model_init_kwargs from experimental BCO by @albertvillanova in #4946
Remove max_prompt_length from experimental PRM by @albertvillanova in #4963
Remove max_prompt_length from experimental BCO by @albertvillanova in #4964
Remove max_prompt_length from experimental CPO by @albertvillanova in #4965
Remove max_prompt_length from experimental ORPO by @albertvillanova in #4966
Remove padding_value from experimental CPO and use pad_token_id by @albertvillanova in #4962

Fixes

Fix _patch_transformers_hybrid_cache for peft by @albertvillanova in #4844
Refactor KTO [4/N]: Remove unused padding_value by @albertvillanova in #4839
Fix: undefined current_gradient_accumulation_steps by @qgallouedec in #4852
fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
Fix import path for get_open_port based on vLLM version by @qgallouedec in #4883
Fix RewardTrainer's results not reproducible by @liyc-ai in #4887
device_map init consistency in GRPO/RLOO/KTO by @qgallouedec in #4909
Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908
Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
Fix PPO run_name parameter not taking effect by @mel3c in #4945
Remove access to warnings_issued by @qgallouedec in #4960
Revert change in GRPO from NeMo-Gym Integration by @qgallouedec in #4970

Documentation and Examples

Add Nash Learning from Human Feedback paper to paper index by @kansalaman in #4860
Update OpenEnv dependency to new version for hf jobs scripts by @sergiopaniego in #4843
Enhance GRPO documentation with scaling notes by @javadtaghia in #4849
Created new PTT integration docs as requested by @adityachallapally in #4907
docs: add DoRA (2402.09353) to Paper Index by @billycrapediem in #4892

Deprecations

Remove unused padding_value from BCO by @albertvillanova in #4846
Remove deprecated parameters by @qgallouedec in #4847
Deprecate parameters in DPOConfig by @qgallouedec in #4969
Replace warmup_ratio with warmup_steps by @qgallouedec in #4983

CI Improvements

Support triggering CI via push to ci-* branches by @albertvillanova in #4840
Revert CI hotfix pinning transformers 4.57.4 after tiny model regeneration by @albertvillanova in #4833
Use pytest-datadir in CI tests by @albertvillanova in #4836
Fix CI with dev dependencies: Mark Qwen3-VL tests as xfail by @albertvillanova in #4851
Use pytest-datadir for accelerate config files by @albertvillanova in #4861
Update transformer version checks and documentation for lr_scheduler_kwargs workaround by @qgallouedec in #4876
Test distributed training for RewardTrainer, RLOOTrainer and GRPOTrainer by @qgallouedec in #4823
Mark ZeRO 2 as xfail in distributed tests due to current failure by @qgallouedec in #4885
Transformers v5 release: extend xfail condition for TestGRPOTrainer.test_training_vlm_and_liger and update version checks by @qgallouedec in #4898
Fix CI NotImplementedError for bfloat16 by @albertvillanova in #4902
Fix CI AssertionError: Parameter has not changed by @albertvillanova in #4904
Fix CI TypeError in llm-blender tests by @albertvillanova in #4919
Fix CI AssertionError: assert not True by @albertvillanova in #4921
Fix CI ValueError for 0 temperature by @albertvillanova in #4916
Set model dtype to float32 in tests of trainers by @albertvillanova in #4924
Set model dtype to float32 in experimental tests of trainers by @albertvillanova in #4925
Add test for training with compute_metrics in SFTTrainer by @qgallouedec in #4950
Add test for tool call data in RewardTrainer by @qgallouedec in #4959
Add test for training with compute_metrics in RewardTrainer by @qgallouedec in #4958
Fix test_train_with_chat_template_kwargs by @qgallouedec in #4971

Miscellaneous

Update CITATION.cff by @qgallouedec in #4856
Update generate_tiny_models.py: CohereForAI -> CohereLabs by @Michellehbn in #4877
Refactor vLLM generation [1/N]: Extract vLLM generation by @albertvillanova in #4700
Rearrange variable assignments in DataCollatorForVisionLanguageModeling by @qgallouedec in #4911
Fix help text formatting for max_length in RewardConfig and SFTConfig by @qgallouedec in #4910
Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer by @qgallouedec in #4913
Remove gradient checkpointing option from various training scripts by @qgallouedec in #4905
Remove chat template setup in dpo_vlm.py by @qgallouedec in #4906
Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests by @qgallouedec in #4914
Add validation for sync_ref_model in GRPOTrainer and RLOOTrainer when using PEFT models by @qgallouedec in #4912
Require transformers<5 with PairRMJudge by @albertvillanova in #4926
Move VLLMClient to generation module by @albertvillanova in #4928
Fix profiling of VLLMGeneration.sync_weights by @albertvillanova in #4931
Fix import statement for import_utils in vllm_client.py by @qgallouedec in #4932
Set default top_k to 0 in VLLMClient by @albertvillanova in #4927
Minor fix docs style by @albertvillanova in #4953

What's Changed

⬆️ Bump dev version by @qgallouedec in #4835
Support triggering CI via push to ci-* branches by @albertvillanova in #4840
Revert CI hotfix pinning transformers 4.57.4 after tiny mo...

Contributors

winglian, kdubovikov, and 19 other contributors

Assets 2

03 Feb 18:10

qgallouedec

v0.27.2

1101f8f

v0.27.2

What's Changed

Remove access to warnings_issued by @qgallouedec in #4960
Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908

Full Changelog: v0.27.1...v0.27.2

Contributors

albertvillanova and qgallouedec

Assets 2

24 Jan 03:42

qgallouedec

v0.27.1

83afceb

v0.27.1

What's Changed

Fix: undefined current_gradient_accumulation_steps by @qgallouedec in #4852
fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
Fix RewardTrainer's results not reproducible by @liyc-ai in #4887

New Contributors

@kdubovikov made their first contribution in #4873
@liyc-ai made their first contribution in #4887

Full Changelog: v0.27.0...v0.27.1

Contributors

kdubovikov, liyc-ai, and 2 other contributors

Assets 2

16 Jan 02:34

qgallouedec

v0.27.0

17acd61

v0.27.0

Features

Add vllm_group_port argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in #4545
Preserve truncated tokens in BFD packing by @qgallouedec in #4632
Support async reward functions and parallelize call to reward functions. by @pramodith in #4567
RLOO supports async rewards. by @pramodith in #4718
Support vLLM 0.12.0 by @jiqing-feng in #4117
feat: DeepSeek V3.2 Off-policy sequence masking by @casinca in #4689
🎭 Up to 50% less VRAM during forward with forward_masked_logits function by @qgallouedec in #4729
[GRPO] Add a config to limit the number of tool calling iterations by @pramodith in #4761
Switch gradient checkpointing default to use_reentrant=False (PyTorch recommended) by @qgallouedec in #4811
Add support for GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization by @nbasyl in #4785

Experimental

Move AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqLMWithValueHead to experimental by @qgallouedec in #4654
Move DPODataCollatorWithPadding to experimental.utils by @qgallouedec in #4667
Move DataCollatorForChatML to experimental.utils by @qgallouedec in #4668
Move add_bos_token_if_needed and add_eos_token_if_needed to experimental.utils by @qgallouedec in #4674
Move truncate_right and SIMPLE_CHAT_TEMPLATE to experimental.utils by @qgallouedec in #4677
Move prepare_model_for_kbit_training, enable_gradient_checkpointing, prepare_peft_model to experimental.utils by @qgallouedec in #4704
Move get_reward function to experimental.utils by @qgallouedec in #4683
Remove experimental imports from testing_utils by @albertvillanova in #4727
ORPO: Avoid catastrophic cancellation in loss function by @hartmans in #4763
Refactor KTO [1/N]: Modernize model initialization by @albertvillanova in #4783
[GOLD] add probability merging fix to implement chain rule by @kashif in #4765
Refactor KTO coordinated with DPO [a/N]: Remove encoder-decoder support by @albertvillanova in #4792
Refactor KTO coordinated with DPO [b/N]: Simplify truncation logic by @albertvillanova in #4808

Fixes

Accounting for case num_generations_eval=1 in the calculation of the advantage by @qgallouedec in #4662
Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
Fix GRPO config validation in case num_generations_eval is specified and different than num_generations by @apalmas-saifh in #4682
Fix top_k default value to 0 for disabling top-k filtering by @albertvillanova in #4695
Include generation_config for tiny model uploads by @qgallouedec in #4643
Fix KeyError with transformers 5.0.0+ where push_to_hub_token is removed by @Manodeepray in #4691
Overwrite model default generation config used by model.generate by @albertvillanova in #4647
Fix: handle multiple tool calls in qwen3_schema by @mattbui in #4709
Fix bugs when using multi-gpu: dataset streaming for offline trainers + dtype initialization by @kaixuanliu in #3950
Ensure llm-blender is importable with transformers >= v5 by @albertvillanova in #4781
Monkey patch for HybridCache in Liger-Kernel with transformers v5 by @qgallouedec in #4798
[fix] GRPOTrainer: proper access args by @carlyou in #4801
Fix vllm compat patches to be applied only to affected versions by @albertvillanova in #4815
fix bug when sft calc outputs.token_accuracy by @kaixuanliu in #4814
fix xpu vllm client server by @jiqing-feng in #4780

Documentation and Examples

docs: add RapidFire AI integration section to SFT Trainer by @kamran-rapidfireAI in #4661
Fix environment image name for BrowserGym example script by @sergiopaniego in #4680
Docs(grpo_trainer.md): Added Qwen SAPO details under Loss Types by @casinca in #4681
[docs] Adds GRPO, RSO and LoRA to Paper Index by @SSusantAchary in #4441
Enable zero3 init and 16-bit model saving for ds ulysses config by @edbeeching in #4701
Set version to packaged one in notebooks by @sergiopaniego in #4648
BrowserGym example for LLMs (no vision) by @sergiopaniego in #4696
docs: Add RapidFire AI cross-references to DPO and GRPO trainer docs by @kamran-rapidfireAI in #4705
[docs] Fix RapidFire AI position in documentation by @qgallouedec in #4715
Add inference example to GRPO agent training notebook by @sergiopaniego in #4710
Upload FunctionGemma notebook by @sergiopaniego in #4721
Update agents notebook dependencies by @sergiopaniego in #4724
Add uv/hf jobs support to OpenEnv scripts by @sergiopaniego in #4720
Add GRPO QLoRA free notebook by @sergiopaniego in #4660
Hotfix for browsergym openenv notebook by @sergiopaniego in #4740
docs: fix "Good Second Issue" redirection link by @casinca in #4749
[Docs] Add SRL (Supervised Reinforcement Learning) to Community Tutorials by @s23deepak in #4758
Add LFM2.5 to GRPO notebook by @sergiopaniego in #4793
Sudoku GRPO example script using TextArena by @sergiopaniego in #4762
[EXAMPLES] Update wordle to new openenv release by @burtenshaw in #4791
Update the typos in docs/source/grpo_trainer.md by @Tianyi-Billy-Ma in #4804
Updat examples to new OpenEnv version by @sergiopaniego in #4796
Update GRPO example to use Qwen2.5 instead of Qwen2 by @BurnyCoder in #4803

Deprecations

Remove deprecated functions and parameters by @qgallouedec in #4651
Remove MergeModelCallback from import structure by @qgallouedec in #4664
Remove ChatMlSpecialTokens by @qgallouedec in #4666
Remove unused _win_rate_completions_df function from callbacks by @qgallouedec in #4672
Deprecate max_prompt_length in RLOOTrainer by @albertvillanova in #4703
Small fix on contributing docs by @murilo-cunha in #4753
Remove DbrxForCausalLM support by @qgallouedec in #4799

CI Improvements

Hotfix CI due to generation config by setting tests as xfail by @albertvillanova in #4657
Upgrade GitHub Actions to latest versions by @salmanmkc in #4734
Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #4733
Include data type for tiny models and update tests by @qgallouedec in #4728
Change tiny model dtype from float16 to bfloat16 to fix CUDA error by @albertvillanova in #4745
Add revision override mechanism for testing tiny models by @albertvillanova in #4769
Hotfix: Set float32 as default dtype for testing tiny models by @albertvillanova in #4770
Hotfix CI with dev dependencies: xfail test_training_vlm_and_liger by @albertvillanova in #4777
Add initial multi-GPU CI tests for distributed training by @qgallouedec in #4784
Set dtype default to float32 by @albertvillanova in #4778
Test FSDP2 by @qgallouedec in #4813
Test ZeRO Stage 3 by @qgallouedec in #4821
Hotfix CI main test...

Contributors

kashif, hartmans, and 22 other contributors

Assets 2

Releases: huggingface/trl

v1.2.0

Features

New SSDTrainer — Simple Self-Distillation

Drop, don't truncate, overlong tool results in GRPOTrainer

Expanded tool-calling model support: LLaMA 3.1 / 3.2 & DeepSeek-V3

KTO/DPO alignment push

Other

Fixes

Deprecations

Documentation and Examples

CI

What's Changed

Contributors

Uh oh!

v1.1.0

Features

DistillationTrainer for efficient on-policy distillation

Chunked LM head for memory-efficient log-prob computation in AsyncGRPOTrainer

{% generation %} support in training chat templates

Expanded tool-calling model support

Multimodal tool responses for VLM training

Built-in reward functions now log extra columns

Other

Fixes

Contributors

Uh oh!

v1.0.0

Features

Asynchronous GRPO

Variational Sequence-Level Soft Policy Optimization (VESPO)

Divergence Proximal Policy Optimization (DPPO)

Self-Distillation Policy Optimization (SDPO)

Reward functions can now log extra columns and scalar metrics

Tool calling support in VLLMClient.chat()

35% faster packing

[GKD] Buffer implementation and vLLM inference for distillation trainer

v0 → v1 migration guide

Other

Fixes

Documentation and Examples

Contributors

Uh oh!

v1.0.0rc1

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

Divergence Proximal Policy Optimization (DPPO)

Reward functions can now log extra columns and scalar metrics

Tool calling support in VLLMClient.chat()

35% faster packing

[GKD] Buffer implementation for distillation trainer

v0 → v1 migration guide

Other

Fixes

Documentation and Examples

What's Changed

Contributors

Uh oh!

v0.29.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.29.0

Features

Add environment_factory to GRPOTrainer

Skills

Other

Fixes

Documentation and Examples

Deprecations

CI Improvements

Contributors

Uh oh!

v0.28.0

Features

Experimental

Fixes

Documentation and Examples

Deprecations

New `SSDTrainer` — Simple Self-Distillation

Drop, don't truncate, overlong tool results in `GRPOTrainer`

`DistillationTrainer` for efficient on-policy distillation

Chunked LM head for memory-efficient log-prob computation in `AsyncGRPOTrainer`

`{% generation %}` support in training chat templates

Tool calling support in `VLLMClient.chat()`

Tool calling support in `VLLMClient.chat()`

Add `environment_factory` to `GRPOTrainer`