Skip to content

context-bootstrapped-rl/cbrl

Repository files navigation

Context Bootstrapped Reinforcement Learning (CBRL)

Reinforcement Learning from Verifiable Rewards (RLVR) has driven remarkable advances in LLM reasoning, but suffers from a fundamental problem: exploration inefficiency. When a model cannot reliably produce correct rollouts — whether due to limited pretraining coverage, task complexity, or unfamiliar problem structure — it receives minimal learning signal, leading to slow or failed convergence.

CBRL addresses this by leveraging in-context learning to bootstrap exploration. A small bank of few-shot examples is dynamically injected into training prompts with decreasing probability. Early in training, these examples guide the model toward successful rollouts, providing the learning signal needed to acquire new capabilities. As training advances, the injection probability anneals to zero, and the model learns to perform independently — the bootstrapped behaviors persist without long-term dependence on in-context examples.

CBRL is algorithm-agnostic, yielding consistent gains with both GRPO and RLOO across all tested settings.

Key Results

We validate CBRL on two families of tasks:

  • Reasoning Gym — 5 procedural reasoning tasks (ARC-1D, Word Sorting, Spell Backward, Matrix Manipulation, Puzzle-24) with Qwen and Llama models. CBRL improves accuracy on all 10 model-environment pairs, with gains from +1.3% to +22.3%.
  • Q Programming — A domain-specific language for time-series databases, featuring right-to-left evaluation, implicit typing, and terse array-oriented syntax that diverges from conventions in pretraining corpora. Using qqWen-7B (Morgan Stanley), CBRL improves test-pass rate (27.3% → 43.0%) and Pass@1 (5.0% → 26.3%) over standard GRPO.

Repository Structure

├── cbrl/                   # CBRL training library (built on verl)
│   ├── trainers/           # GRPO trainer with few-shot injection
│   └── utils/              # RL dataset with example injection
├── evaluations/            # Evaluation scripts (pass@k, few-shot eval)
├── data_preprocess/        # Dataset preparation scripts
├── config/                 # Hydra training configs
├── scripts/                # Training and evaluation scripts (SLURM + bash)
├── q_reward.py             # Q program execution reward function
├── reasoning_gym/          # Reasoning Gym experiments (separate README)
└── legacy/                 # Archived code

Setup

Prerequisites

  • Python 3.10+
  • CUDA-compatible GPUs (tested on NVIDIA GH200)
  • verl (RL training framework)
  • KDB/Q interpreter (for Q programming experiments)

Installation

git clone <repository-url>
cd CBRL

# Install verl (we used v0.3.0.post2 with vLLM 0.8.5)
pip install verl

# Install dependencies
pip install torch==2.6.0 vllm==0.8.5 transformers pyarrow tqdm

# Install flash attention
pip install --no-build-isolation flash-attn==2.7.4.post1

Environment Variables

Set these once before running any scripts. The data preparation scripts will write outputs into $DATA_DIR, and training/eval scripts read from it.

# Required
export DATA_DIR=/path/to/data              # All datasets and few-shot files go here
export CHECKPOINT_DIR=/path/to/checkpoints # Training checkpoints saved here
export HF_HOME=/path/to/huggingface_cache  # HuggingFace model cache

# Q programming experiments
export QHOME=~/q                           # KDB/Q installation
export Q_INTERPRETER_PATH=$QHOME/l64arm/q  # Path to Q binary

# SLURM users only
export SLURM_ACCOUNT=<your-account>        # Your cluster allocation
export SLURM_PARTITION=<your-partition>     # Your cluster partition

How CBRL Works

During RL training, CBRL prepends curated solved examples to the prompt before the model generates a response:

  1. Example selection — A configurable number of few-shot examples (default: 2) are sampled from a bank and prepended to the training prompt.
  2. Probability annealing — Injection probability starts high (e.g., 0.5) and linearly decays to 0.0 over training. Early on, most prompts include demonstrations; by the end, the model solves problems entirely on its own.

This schedule provides strong learning signal when the model needs it most, then removes the scaffolding so the model internalizes the reasoning patterns rather than relying on in-context examples.

Q Programming Experiments

Data Preparation

Prepare the Q programming dataset (writes train/test parquets into $DATA_DIR/sft_python_q_problems/):

python3 -m data_preprocess.sft_python_q_problems \
    --output_dir $DATA_DIR/sft_python_q_problems

Generate few-shot examples for CBRL (writes $DATA_DIR/qprog_fewshots.json):

python3 -m data_preprocess.make_qprog_fewshots
python3 -m data_preprocess.annotate_qprog_fewshots_tags \
    --parquet_dir $DATA_DIR/sft_python_q_problems \
    --fewshots_path $DATA_DIR/qprog_fewshots.json

Training

Training uses GRPO via the verl framework. All scripts should be run from the repository root.

With SLURM:

Note: Update <your-account> and <your-partition> in scripts/submit_*.sbatch to match your cluster, or override via sbatch --account=... --partition=....

# Baseline GRPO
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EXPERIMENT_NAME=baseline,FORMAT=raw \
    scripts/submit_grpo.sbatch

# CBRL GRPO (with few-shot injection)
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EXPERIMENT_NAME=cbrl,FORMAT=raw,CBRL=true \
    scripts/submit_grpo.sbatch

Without SLURM:

# Baseline GRPO
MODEL=morganstanley/qqWen-7B-pretrain EXPERIMENT_NAME=baseline FORMAT=raw \
    bash scripts/run_grpo.sh

# CBRL GRPO (with few-shot injection)
MODEL=morganstanley/qqWen-7B-pretrain EXPERIMENT_NAME=cbrl FORMAT=raw CBRL=true \
    FEWSHOT_JSON=$DATA_DIR/qprog_fewshots.json \
    bash scripts/run_grpo.sh
Parameter Description Default
MODEL HuggingFace model path (required)
EXPERIMENT_NAME W&B experiment name (required)
FORMAT raw (pretrain) or chat (instruct) chat
CBRL Enable few-shot injection false
CHECKPOINT_PREFIX Checkpoint directory prefix grpo
FEWSHOT_JSON Path to few-shot JSON (required if CBRL=true)
NUM_GPUS Number of GPUs (bash only) 4

Evaluation

Evaluate a trained checkpoint (converts FSDP to HF format, then runs pass@k):

# SLURM
sbatch --export=ALL,CHECKPOINT_NAME=<name>,STEP=<step> scripts/submit_eval.sbatch

# Bash
CHECKPOINT_DIR=$CHECKPOINT_DIR/<name> STEP=512 bash scripts/run_eval.sh

Evaluate a HuggingFace model directly:

# SLURM
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EVAL_NAME=pretrain scripts/submit_eval_baseline.sbatch

# Bash
MODEL=morganstanley/qqWen-7B-pretrain EVAL_NAME=pretrain \
    PARQUET_PATH=$DATA_DIR/sft_python_q_problems/test.parquet \
    bash scripts/run_eval_baseline.sh

Add FEWSHOT=true and FEWSHOT_JSON=$DATA_DIR/qprog_fewshots.json to evaluate with few-shot prompting.

Reasoning Gym Experiments

See reasoning_gym/README.md for setup, training, and evaluation of the Reasoning Gym experiments across ARC-1D, Word Sorting, Spell Backward, Matrix Manipulation, and Puzzle-24.

Acknowledgments

The Q programming experiments use the qqWen family of models released by Morgan Stanley. We thank them for making these models publicly available.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors