Context Bootstrapped Reinforcement Learning (CBRL)

Reinforcement Learning from Verifiable Rewards (RLVR) has driven remarkable advances in LLM reasoning, but suffers from a fundamental problem: exploration inefficiency. When a model cannot reliably produce correct rollouts — whether due to limited pretraining coverage, task complexity, or unfamiliar problem structure — it receives minimal learning signal, leading to slow or failed convergence.

CBRL addresses this by leveraging in-context learning to bootstrap exploration. A small bank of few-shot examples is dynamically injected into training prompts with decreasing probability. Early in training, these examples guide the model toward successful rollouts, providing the learning signal needed to acquire new capabilities. As training advances, the injection probability anneals to zero, and the model learns to perform independently — the bootstrapped behaviors persist without long-term dependence on in-context examples.

CBRL is algorithm-agnostic, yielding consistent gains with both GRPO and RLOO across all tested settings.

Key Results

We validate CBRL on two families of tasks:

Reasoning Gym — 5 procedural reasoning tasks (ARC-1D, Word Sorting, Spell Backward, Matrix Manipulation, Puzzle-24) with Qwen and Llama models. CBRL improves accuracy on all 10 model-environment pairs, with gains from +1.3% to +22.3%.
Q Programming — A domain-specific language for time-series databases, featuring right-to-left evaluation, implicit typing, and terse array-oriented syntax that diverges from conventions in pretraining corpora. Using qqWen-7B (Morgan Stanley), CBRL improves test-pass rate (27.3% → 43.0%) and Pass@1 (5.0% → 26.3%) over standard GRPO.

Repository Structure

├── cbrl/                   # CBRL training library (built on verl)
│   ├── trainers/           # GRPO trainer with few-shot injection
│   └── utils/              # RL dataset with example injection
├── evaluations/            # Evaluation scripts (pass@k, few-shot eval)
├── data_preprocess/        # Dataset preparation scripts
├── config/                 # Hydra training configs
├── scripts/                # Training and evaluation scripts (SLURM + bash)
├── q_reward.py             # Q program execution reward function
├── reasoning_gym/          # Reasoning Gym experiments (separate README)
└── legacy/                 # Archived code

Setup

Prerequisites

Python 3.10+
CUDA-compatible GPUs (tested on NVIDIA GH200)
verl (RL training framework)
KDB/Q interpreter (for Q programming experiments)

Installation

git clone <repository-url>
cd CBRL

# Install verl (we used v0.3.0.post2 with vLLM 0.8.5)
pip install verl

# Install dependencies
pip install torch==2.6.0 vllm==0.8.5 transformers pyarrow tqdm

# Install flash attention
pip install --no-build-isolation flash-attn==2.7.4.post1

Environment Variables

Set these once before running any scripts. The data preparation scripts will write outputs into $DATA_DIR, and training/eval scripts read from it.

# Required
export DATA_DIR=/path/to/data              # All datasets and few-shot files go here
export CHECKPOINT_DIR=/path/to/checkpoints # Training checkpoints saved here
export HF_HOME=/path/to/huggingface_cache  # HuggingFace model cache

# Q programming experiments
export QHOME=~/q                           # KDB/Q installation
export Q_INTERPRETER_PATH=$QHOME/l64arm/q  # Path to Q binary

# SLURM users only
export SLURM_ACCOUNT=<your-account>        # Your cluster allocation
export SLURM_PARTITION=<your-partition>     # Your cluster partition

How CBRL Works

During RL training, CBRL prepends curated solved examples to the prompt before the model generates a response:

Example selection — A configurable number of few-shot examples (default: 2) are sampled from a bank and prepended to the training prompt.
Probability annealing — Injection probability starts high (e.g., 0.5) and linearly decays to 0.0 over training. Early on, most prompts include demonstrations; by the end, the model solves problems entirely on its own.

This schedule provides strong learning signal when the model needs it most, then removes the scaffolding so the model internalizes the reasoning patterns rather than relying on in-context examples.

Q Programming Experiments

Data Preparation

Prepare the Q programming dataset (writes train/test parquets into $DATA_DIR/sft_python_q_problems/):

python3 -m data_preprocess.sft_python_q_problems \
    --output_dir $DATA_DIR/sft_python_q_problems

Generate few-shot examples for CBRL (writes $DATA_DIR/qprog_fewshots.json):

python3 -m data_preprocess.make_qprog_fewshots
python3 -m data_preprocess.annotate_qprog_fewshots_tags \
    --parquet_dir $DATA_DIR/sft_python_q_problems \
    --fewshots_path $DATA_DIR/qprog_fewshots.json

Training

Training uses GRPO via the verl framework. All scripts should be run from the repository root.

With SLURM:

Note: Update <your-account> and <your-partition> in scripts/submit_*.sbatch to match your cluster, or override via sbatch --account=... --partition=....

# Baseline GRPO
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EXPERIMENT_NAME=baseline,FORMAT=raw \
    scripts/submit_grpo.sbatch

# CBRL GRPO (with few-shot injection)
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EXPERIMENT_NAME=cbrl,FORMAT=raw,CBRL=true \
    scripts/submit_grpo.sbatch

Without SLURM:

# Baseline GRPO
MODEL=morganstanley/qqWen-7B-pretrain EXPERIMENT_NAME=baseline FORMAT=raw \
    bash scripts/run_grpo.sh

# CBRL GRPO (with few-shot injection)
MODEL=morganstanley/qqWen-7B-pretrain EXPERIMENT_NAME=cbrl FORMAT=raw CBRL=true \
    FEWSHOT_JSON=$DATA_DIR/qprog_fewshots.json \
    bash scripts/run_grpo.sh

Parameter	Description	Default
`MODEL`	HuggingFace model path (required)	—
`EXPERIMENT_NAME`	W&B experiment name (required)	—
`FORMAT`	`raw` (pretrain) or `chat` (instruct)	`chat`
`CBRL`	Enable few-shot injection	`false`
`CHECKPOINT_PREFIX`	Checkpoint directory prefix	`grpo`
`FEWSHOT_JSON`	Path to few-shot JSON (required if `CBRL=true`)	—
`NUM_GPUS`	Number of GPUs (bash only)	`4`

Evaluation

Evaluate a trained checkpoint (converts FSDP to HF format, then runs pass@k):

# SLURM
sbatch --export=ALL,CHECKPOINT_NAME=<name>,STEP=<step> scripts/submit_eval.sbatch

# Bash
CHECKPOINT_DIR=$CHECKPOINT_DIR/<name> STEP=512 bash scripts/run_eval.sh

Evaluate a HuggingFace model directly:

# SLURM
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EVAL_NAME=pretrain scripts/submit_eval_baseline.sbatch

# Bash
MODEL=morganstanley/qqWen-7B-pretrain EVAL_NAME=pretrain \
    PARQUET_PATH=$DATA_DIR/sft_python_q_problems/test.parquet \
    bash scripts/run_eval_baseline.sh

Add FEWSHOT=true and FEWSHOT_JSON=$DATA_DIR/qprog_fewshots.json to evaluate with few-shot prompting.

Reasoning Gym Experiments

See reasoning_gym/README.md for setup, training, and evaluation of the Reasoning Gym experiments across ARC-1D, Word Sorting, Spell Backward, Matrix Manipulation, and Puzzle-24.

Acknowledgments

The Q programming experiments use the qqWen family of models released by Morgan Stanley. We thank them for making these models publicly available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context Bootstrapped Reinforcement Learning (CBRL)

Key Results

Repository Structure

Setup

Prerequisites

Installation

Environment Variables

How CBRL Works

Q Programming Experiments

Data Preparation

Training

Evaluation

Reasoning Gym Experiments

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cbrl		cbrl
config		config
data_preprocess		data_preprocess
evaluations		evaluations
reasoning_gym		reasoning_gym
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
q_reward.py		q_reward.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Context Bootstrapped Reinforcement Learning (CBRL)

Key Results

Repository Structure

Setup

Prerequisites

Installation

Environment Variables

How CBRL Works

Q Programming Experiments

Data Preparation

Training

Evaluation

Reasoning Gym Experiments

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages