Skip to content

Releases: huggingface/trl

v1.2.0

17 Apr 01:13
aca4515

Choose a tag to compare

Features

New SSDTrainer — Simple Self-Distillation

Screenshot 2026-04-16 at 9 08 04 PM

A new experimental SSDTrainer implements the method described in Embarrassingly Simple Self-Distillation Improves Code Generation. SSD samples completions from the model itself at a training-time temperature/truncation setting, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. No reward model, verifier, teacher model, or RL: just prompts and the model.

from datasets import Dataset
from trl.experimental.ssd import SSDConfig, SSDTrainer

dataset = Dataset.from_dict({
    "prompt": [
        [{"role": "user", "content": "Write a function to add two numbers."}],
        [{"role": "user", "content": "Write a function to check if a number is prime."}],
    ],
})

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(
        output_dir="ssd-model",
        temperature=0.6,      # T_train from the paper
        top_k=20,
        top_p=0.95,
        learning_rate=5e-6,
    ),
    train_dataset=dataset,
)
trainer.train()

by @kashif in #5505

Drop, don't truncate, overlong tool results in GRPOTrainer

When tool calls produce more tokens than max_completion_length allows, GRPOTrainer now rolls back the tool messages/images added in the current iteration instead of trying to truncate them. This removes ~80 lines of fragile, image-boundary-aware bookkeeping in favor of a ~15-line snapshot-and-rollback. Since overlong samples almost always get rewarded as failures anyway, the learning signal is effectively unchanged — but the code is dramatically simpler and no longer needs per-VLM-family vision-token lookup tables.

by @qgallouedec in #5521

Expanded tool-calling model support: LLaMA 3.1 / 3.2 & DeepSeek-V3

Continuing the effort from v1.1:

  • LLaMA 3.1 and 3.2 tool-calling response schemas, with dedicated templates for identity matching. Note that these templates only support a single tool call and no content alongside the tool call — limitations inherited from the models' native templates. By @qgallouedec in #5518
  • DeepSeek-V3 training chat template with {% generation %} markers, enabling assistant-only loss masking for DeepSeek-V3 models. By @RudrenduPaul in #5527

As a result of a tightened detection (see fixes below), the list of templates reported as tool-calling capable is now correct — notably, the basic Llama 3 template is no longer falsely classified as tool-calling capable.

KTO/DPO alignment push

A major cleanup sweep keeps KTOTrainer and DPOTrainer in lockstep, same initialization patterns, same config surface, same precompute behavior:

  • Add precompute_ref_batch_size to KTO (#5530)
  • Align ref_model initialization (#5534)
  • Align model initialization (#5533)
  • Support None args (#5531)
  • Remove generate_during_eval (#5551)
  • Remove model and ref adapter names (#5552)
  • Don't load ref_model when precompute_ref_log_probs is set in DPO/KTO (#5542)

All by @albertvillanova.

Other

Fixes

  • Fix supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in #5517
  • Fix add_response_schema for VLM processors — the schema was being set on the outer processor instead of the inner tokenizer, so it had no effect. This also collapses a handful of __init__/decode-gate workarounds. By @qgallouedec in #5520
  • Remove xfail condition for Gemma 4 response_schema regex bug by @qgallouedec in #5510
  • Remove unused dependencies for judges from dev requirements by @qgallouedec in #5515

Deprecations

  • Deprecate use_transformers_paged in GRPOConfig and RLOOConfig (and remove entirely from experimental OnlineDPOConfig, GOLDConfig, SelfDistillationConfig). Will be removed from the remaining configs in v2.0.0. In a small A/B benchmark (Qwen3-0.6B GRPO), the paged path is ~20% slower and uses ~6x more peak VRAM than the default; it's also superseded by transformers continuous batching. By @qgallouedec in #5544

Documentation and Examples

CI

What's Changed

Read more

v1.1.0

12 Apr 02:15
3179965

Choose a tag to compare

Features

DistillationTrainer for efficient on-policy distillation

Read the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer

off_vs_on_policy_distillation yePX-mwe_1umXK5

The new DistillationTrainer implements on-policy knowledge distillation as described in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. It extends the ideas from the GKDTrainer with three key optimizations: a generation buffer that decouples the training microbatch size from the generation batch size (up to 40x speedup), external teacher server support so the teacher doesn't need to fit on training GPUs, and binary-encoded logprob payloads that shrink transfer payloads by ~5x.

from datasets import load_dataset
from trl.experimental.distillation import DistillationConfig, DistillationTrainer

dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(
    lambda x: {"messages": [{"role": "user", "content": x["question"]}]},
    remove_columns=dataset.column_names,
)

trainer = DistillationTrainer(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    teacher_model="Qwen/Qwen2.5-7B-Instruct",
    args=DistillationConfig(
        output_dir="results/distill-qwen-gsm8k",
        lmbda=1.0,                   # fully on-policy (student generates)
        beta=1.0,                    # reverse KL
        teacher_model_init_kwargs={"torch_dtype": "bfloat16"},
    ),
    train_dataset=dataset,
)
trainer.train()

by @cmpatino in #5407, #5500 and #5501

Chunked LM head for memory-efficient log-prob computation in AsyncGRPOTrainer

AsyncGRPOTrainer now supports a chunked LM-head path that computes per-token log-probs and entropy via online logsumexp without materializing the full [N, V] logits tensor. Combined with completion_mask filtering to skip prompt tokens, this brings massive memory savings on long sequences — up to 44x lower peak-allocated memory on an 8192-token sequence:

chunk_lm_head_size Peak Alloc (GB) Reduction Wall Time (ms)
None (baseline) 18.55 1.00x 808.7
4096 0.42 44.32x 459.0
8192 0.76 24.34x 393.0

Enable it via the new chunk_lm_head_size option in AsyncGRPOConfig:

from trl.experimental.async_grpo import AsyncGRPOConfig, AsyncGRPOTrainer

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    args=AsyncGRPOConfig(chunk_lm_head_size=4096),
    ...
)

Note: mutually exclusive with use_liger_kernel (both replace the LM head forward pass).

by @AmineDiro in #5349

{% generation %} support in training chat templates

SFT with assistant_only_loss=True requires chat templates to include {% generation %} / {% endgeneration %} markers so that return_assistant_tokens_mask=True produces correct masks. Very few models ship these markers natively, so users hit a cryptic error when enabling assistant-only loss with models like Qwen3, Llama 3 or GPT-OSS.

SFTTrainer now automatically swaps in a patched training chat template when the original template lacks generation markers — no manual template surgery required. Training templates are shipped for Qwen2.5, Qwen3, Llama 3 and GPT-OSS, stored as standalone .jinja files under trl/chat_templates/ for readability, diffability, and editor syntax highlighting.

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3-4B",
    args=SFTConfig(assistant_only_loss=True),  # now just works
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5459, #5470, by @RudrenduPaul in #5493 and #5522, and by @casinca in #5484

Expanded tool-calling model support

Agent training now supports a broader family of models via native tool-call response schemas:

  • GPT-OSS (#5464)
  • GLM-4-MoE (#5463)
  • Qwen3-VL (#5469)
  • Gemma 4 — the first model to natively ship a response schema (#5454)

A new supports_tool_calling() utility detects whether a tokenizer/processor can render a full tool-calling turn, and GRPOTrainer now validates tool support at initialization — raising a clear error upfront instead of failing cryptically mid-training.

by @qgallouedec in #5462, #5464, #5463, #5469 and #5454

Multimodal tool responses for VLM training

environment_factory tool methods can now return multimodal content blocks (images + text) for VLM training. Previously, tool responses were always converted to str(result), discarding any visual information. Now tools can return content block lists with images, and the trainer handles them end-to-end through tokenization, generation, and the forward pass — including correct pixel_values plumbing.

class ScreenshotEnv:
    def take_screenshot(self) -> list[dict]:
        return [
            {"type": "image", "image": self.browser.screenshot()},
            {"type": "text", "text": "Current page state"},
        ]

The OpenEnv browsergym.py example has been migrated to this pattern, and a new carla_vlm.py example demonstrates VLM training against CARLA with camera-image tool responses.

by @sergiopaniego in #5323 and #5437, and by @qgallouedec in #5448

Built-in reward functions now log extra columns

accuracy_reward and reasoning_accuracy_reward now emit extra diagnostic columns (solution, gold_parsed, answer_parsed) via the log_extra callback introduced in v1.0.0. These show up in the rich completions table, making it much easier to debug why a reward was (or wasn't) assigned.

accuracy reward with extra columns
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    args=GRPOConfig(log_completions=True),
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5308

Other

Fixes

Read more

v1.0.0

31 Mar 14:15
f3e9ac1

Choose a tag to compare

thumbnail-2

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.

from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5293

Variational Sequence-Level Soft Policy Optimization (VESPO)

Screenshot 2026-03-20 at 5 49 50 PM

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in #5199

Divergence Proximal Policy Optimization (DPPO)

z_TXYw37xZqsQ21YiDkYL SfgWotuuuRKPkg-0bxWv1

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in #5117

Self-Distillation Policy Optimization (SDPO)

SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.

from trl.experimental import SDPOTrainer, SDPOConfig

config = SDPOConfig(
    output_dir="./results",
    num_generations=8,
    success_reward_threshold=1.0,
    use_successful_as_teacher=True,
)

trainer = SDPOTrainer(
    model="Qwen/Qwen2.5-Math-1.5B-Instruct",
    reward_funcs=[accuracy_reward],
    args=config,
    train_dataset=dataset,
)
trainer.train()

by @MengAiDev in #4935

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards
image image

by @manueldeprada in #5233

Tool calling support in VLLMClient.chat()

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in #4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

benchmark_results

by @mariosasko in #5189

[GKD] Buffer implementation and vLLM inference for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.

by @cmpatino in #5137 and #5388

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in #5255

Other

Fixes

  • Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
  • Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
  • Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
  • Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
  • Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
  • Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
  • [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
  • [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
  • Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
  • fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
  • Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
  • Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
  • Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
  • Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
  • Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
  • Clean up model update group on worker exit by @AmineDiro in #5325
  • Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in #5330
  • Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
  • Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354

Documentation and Examples

Read more

v1.0.0rc1

20 Mar 23:55

Choose a tag to compare

v1.0.0rc1 Pre-release
Pre-release

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

Screenshot 2026-03-20 at 5 49 50 PM

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in #5199

Divergence Proximal Policy Optimization (DPPO)

z_TXYw37xZqsQ21YiDkYL SfgWotuuuRKPkg-0bxWv1

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in #5117

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards
image image

by @manueldeprada in #5233

Tool calling support in VLLMClient.chat()

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in #4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

benchmark_results

by @mariosasko in #5189

[GKD] Buffer implementation for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.

by @cmpatino in #5137

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in #5255

Other

Fixes

  • Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
  • Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
  • Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
  • Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
  • Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
  • Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
  • [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
  • [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
  • Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
  • fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
  • Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
  • Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
  • Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
  • Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
  • Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
  • Clean up model update group on worker exit by @AmineDiro in #5325

Documentation and Examples

What's Changed

Read more

v0.29.1

20 Mar 03:57

Choose a tag to compare

What's Changed

  • Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
  • Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
  • Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
  • Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
  • Simplify logic for structured outputs across vLLM versions by @albertvillanova in #5215
  • Add support for raw ids in prompts in vLLM client and server by @qgallouedec in #5225
  • Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
  • Move rollout_func from _generate_single_turn to _generate by @qgallouedec in #5232
  • [GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in #5238
  • Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in #5259
  • [GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn by @qgallouedec in #5239
  • [GRPO/RLOO] Extract tokenize prompts from _generate_single_turn by @qgallouedec in #5240
  • [CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in #4639
  • Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5258
  • Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
  • Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
  • Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
  • Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
  • [GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in #5242

New Contributors

Full Changelog: v0.29.0...v0.29.1

v0.29.0

25 Feb 22:38
d24e194

Choose a tag to compare

Features

Add environment_factory to GRPOTrainer

GRPOTrainer now accepts an environment_factory argument, allowing users to specify a custom environment class for training. This enables more flexible and diverse training scenarios by letting users define their own environments with specific dynamics and reward structures.

from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer

dataset = Dataset.from_dict({
    "prompt": [[{"role": "user", "content": f"Increment the counter by {i}."}] for i in range(1, 7)]
})

def reward_func(environments, **kwargs):
    return [env.counter for env in environments]

class IncrementEnv:
    def reset(self):
        self.counter = 0

    def increment(self, step: int) -> int:
        """
        Increment the internal counter.

        Args:
            step: Value to add to the counter.

        Returns:
            The updated counter value.
        """
        self.counter += step
        return self.counter

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(chat_template_kwargs={"enable_thinking": False}),
    train_dataset=dataset,
    reward_funcs=reward_func,
    environment_factory=IncrementEnv,
)
trainer.train()

by @qgallouedec in #5093

Skills

TRL introduces agent-native CLI Integration: trl-training, a first-class Agent Skill that exposes TRL’s training workflows (SFT, DPO, GRPO, etc.) in a structured, agent-readable format. The skill is packaged directly with the trl library and can be installed via the CLI:

# Install into the project's agent directory (default scope=project), by agent name: claude, codex, opencode
trl skills install trl-training --target <agent>

This enables AI agents to safely and reproducibly execute TRL training workflows using a well-defined interface.

Skills can be installed at the project or global scope, and support explicit targets and overwrite controls.

Other

Fixes

Documentation and Examples

Deprecations

CI Improvements

  • Upgrade GitHub Actions to latest versions by @salmanmkc in #4893
  • Remove duplicated tests for SFT and add gradient checkpointing tests by @qgallouedec in #5054
  • Up...
Read more

v0.28.0

10 Feb 13:28
49ef334

Choose a tag to compare

Features

Experimental

Fixes

Documentation and Examples

Deprecations

CI Improvements

Miscellaneous

What's Changed

Read more

v0.27.2

03 Feb 18:10

Choose a tag to compare

What's Changed

  • Remove access to warnings_issued by @qgallouedec in #4960
  • Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
  • Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908

Full Changelog: v0.27.1...v0.27.2

v0.27.1

24 Jan 03:42

Choose a tag to compare

What's Changed

  • Fix: undefined current_gradient_accumulation_steps by @qgallouedec in #4852
  • fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
  • Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
  • Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
  • Fix RewardTrainer's results not reproducible by @liyc-ai in #4887

New Contributors

Full Changelog: v0.27.0...v0.27.1

v0.27.0

16 Jan 02:34
17acd61

Choose a tag to compare

Features

  • Add vllm_group_port argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in #4545
  • Preserve truncated tokens in BFD packing by @qgallouedec in #4632
  • Support async reward functions and parallelize call to reward functions. by @pramodith in #4567
  • RLOO supports async rewards. by @pramodith in #4718
  • Support vLLM 0.12.0 by @jiqing-feng in #4117
  • feat: DeepSeek V3.2 Off-policy sequence masking by @casinca in #4689
  • 🎭 Up to 50% less VRAM during forward with forward_masked_logits function by @qgallouedec in #4729
  • [GRPO] Add a config to limit the number of tool calling iterations by @pramodith in #4761
  • Switch gradient checkpointing default to use_reentrant=False (PyTorch recommended) by @qgallouedec in #4811
  • Add support for GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization by @nbasyl in #4785

Experimental

  • Move AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqLMWithValueHead to experimental by @qgallouedec in #4654
  • Move DPODataCollatorWithPadding to experimental.utils by @qgallouedec in #4667
  • Move DataCollatorForChatML to experimental.utils by @qgallouedec in #4668
  • Move add_bos_token_if_needed and add_eos_token_if_needed to experimental.utils by @qgallouedec in #4674
  • Move truncate_right and SIMPLE_CHAT_TEMPLATE to experimental.utils by @qgallouedec in #4677
  • Move prepare_model_for_kbit_training, enable_gradient_checkpointing, prepare_peft_model to experimental.utils by @qgallouedec in #4704
  • Move get_reward function to experimental.utils by @qgallouedec in #4683
  • Remove experimental imports from testing_utils by @albertvillanova in #4727
  • ORPO: Avoid catastrophic cancellation in loss function by @hartmans in #4763
  • Refactor KTO [1/N]: Modernize model initialization by @albertvillanova in #4783
  • [GOLD] add probability merging fix to implement chain rule by @kashif in #4765
  • Refactor KTO coordinated with DPO [a/N]: Remove encoder-decoder support by @albertvillanova in #4792
  • Refactor KTO coordinated with DPO [b/N]: Simplify truncation logic by @albertvillanova in #4808

Fixes

  • Accounting for case num_generations_eval=1 in the calculation of the advantage by @qgallouedec in #4662
  • Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
  • Fix GRPO config validation in case num_generations_eval is specified and different than num_generations by @apalmas-saifh in #4682
  • Fix top_k default value to 0 for disabling top-k filtering by @albertvillanova in #4695
  • Include generation_config for tiny model uploads by @qgallouedec in #4643
  • Fix KeyError with transformers 5.0.0+ where push_to_hub_token is removed by @Manodeepray in #4691
  • Overwrite model default generation config used by model.generate by @albertvillanova in #4647
  • Fix: handle multiple tool calls in qwen3_schema by @mattbui in #4709
  • Fix bugs when using multi-gpu: dataset streaming for offline trainers + dtype initialization by @kaixuanliu in #3950
  • Ensure llm-blender is importable with transformers >= v5 by @albertvillanova in #4781
  • Monkey patch for HybridCache in Liger-Kernel with transformers v5 by @qgallouedec in #4798
  • [fix] GRPOTrainer: proper access args by @carlyou in #4801
  • Fix vllm compat patches to be applied only to affected versions by @albertvillanova in #4815
  • fix bug when sft calc outputs.token_accuracy by @kaixuanliu in #4814
  • fix xpu vllm client server by @jiqing-feng in #4780

Documentation and Examples

Deprecations

CI Improvements

Read more