REINFORCE Leave-One-Out (RLOO)
REINFORCE Leave-One-Out (RLOO) is a reinforcement learning algorithm based on the classic REINFORCE policy-gradient method. It constructs an unbiased advantage baseline via the Leave-One-Out (LOO) technique.
Algorithm Overview
For clarity, we explain RLOO by contrasting it with GRPO (Group Relative Policy Optimization).
Both GRPO and RLOO estimate advantages via intra-group comparisons to avoid the high variance of a global baseline. Their core differences are mainly in the following aspects:
Difference 1: How the Advantage Baseline Is Constructed
1. GRPO (Group Relative Policy Optimization)
For each prompt, GRPO generates \(G\) response samples and normalizes rewards using the group mean and standard deviation:
Where:
\(R_i\) is the reward of the \(i\)-th sample
\(\text{mean}(\{R_j\}_{j=1}^G) = \frac{1}{G}\sum_{j=1}^G R_j\) is the group mean
\(\text{std}(\{R_j\}_{j=1}^G)\) is the group standard deviation
2. RLOO (REINFORCE Leave-One-Out)
For each prompt, RLOO generates \(K\) response samples and constructs the baseline via Leave-One-Out, i.e., for the \(i\)-th sample, the baseline is the mean of the other \(K-1\) samples:
This can be equivalently rewritten as:
where \(\bar{R} = \frac{1}{K}\sum_{j=1}^K R_j\) is the group mean reward.
Note: We use \(K\) here to match the notation in the paper. It has the same meaning as \(G\) in GRPO and corresponds to the configuration parameter
num_generations.
Why Leave-One-Out?
The key advantage is unbiasedness. For the \(i\)-th sample, its reward \(R_i\) is independent of the baseline \(\frac{1}{K-1}\sum_{j \neq i} R_j\), hence the advantage estimate is unbiased. In contrast, using the mean including itself as the baseline introduces bias.
Difference 2: How KL Regularization Is Applied
To prevent the policy from drifting too far from the reference policy, both algorithms introduce KL divergence regularization, but in different ways:
GRPO: Adds KL divergence as an independent regularization term to the loss:
RLOO: Integrates KL divergence directly into the reward, constructing a modified reward:
where \(\beta\) is the KL coefficient (parameter beta), and \(\pi_{\text{ref}}\) is the reference policy (typically an SFT model or the initial policy).
Parameter Configuration
RLOO training can be enabled based on GRPOTrainer by setting the following parameters:
# Basic RLOO configuration
--advantage_estimator rloo # Use RLOO's leave-one-out advantage estimator
--kl_in_reward true # Integrate KL divergence into the reward (default for RLOO)
You can refer to this script for training.
Important Parameters
--advantage_estimator: Choose the advantage estimatorgrpo(default): standardize using group mean and standard deviationrloo: construct the baseline via Leave-One-Out
--kl_in_reward: Controls where the KL term is appliedfalse: KL as a separate regularization term in the loss (GRPO style)true: subtract KL directly from the reward to form a modified reward (RLOO style)
--num_generations: Number of samples per prompt, i.e., \(K\)--beta: KL regularization coefficient \(\beta\)Controls how conservatively the policy updates
Other parameters are consistent with the GRPO arguments.