Reinforcement Learning from Verifiable Rewards (RLVR) has driven remarkable advances in LLM reasoning, but suffers from a fundamental problem: exploration inefficiency. When a model cannot reliably produce correct rollouts — whether due to limited pretraining coverage, task complexity, or unfamiliar problem structure — it receives minimal learning signal, leading to slow or failed convergence.
CBRL addresses this by leveraging in-context learning to bootstrap exploration. A small bank of few-shot examples is dynamically injected into training prompts with decreasing probability. Early in training, these examples guide the model toward successful rollouts, providing the learning signal needed to acquire new capabilities. As training advances, the injection probability anneals to zero, and the model learns to perform independently — the bootstrapped behaviors persist without long-term dependence on in-context examples.
CBRL is algorithm-agnostic, yielding consistent gains with both GRPO and RLOO across all tested settings.
We validate CBRL on two families of tasks:
- Reasoning Gym — 5 procedural reasoning tasks (ARC-1D, Word Sorting, Spell Backward, Matrix Manipulation, Puzzle-24) with Qwen and Llama models. CBRL improves accuracy on all 10 model-environment pairs, with gains from +1.3% to +22.3%.
- Q Programming — A domain-specific language for time-series databases, featuring right-to-left evaluation, implicit typing, and terse array-oriented syntax that diverges from conventions in pretraining corpora. Using qqWen-7B (Morgan Stanley), CBRL improves test-pass rate (27.3% → 43.0%) and Pass@1 (5.0% → 26.3%) over standard GRPO.
├── cbrl/ # CBRL training library (built on verl)
│ ├── trainers/ # GRPO trainer with few-shot injection
│ └── utils/ # RL dataset with example injection
├── evaluations/ # Evaluation scripts (pass@k, few-shot eval)
├── data_preprocess/ # Dataset preparation scripts
├── config/ # Hydra training configs
├── scripts/ # Training and evaluation scripts (SLURM + bash)
├── q_reward.py # Q program execution reward function
├── reasoning_gym/ # Reasoning Gym experiments (separate README)
└── legacy/ # Archived code
- Python 3.10+
- CUDA-compatible GPUs (tested on NVIDIA GH200)
- verl (RL training framework)
- KDB/Q interpreter (for Q programming experiments)
git clone <repository-url>
cd CBRL
# Install verl (we used v0.3.0.post2 with vLLM 0.8.5)
pip install verl
# Install dependencies
pip install torch==2.6.0 vllm==0.8.5 transformers pyarrow tqdm
# Install flash attention
pip install --no-build-isolation flash-attn==2.7.4.post1Set these once before running any scripts. The data preparation scripts will write outputs into $DATA_DIR, and training/eval scripts read from it.
# Required
export DATA_DIR=/path/to/data # All datasets and few-shot files go here
export CHECKPOINT_DIR=/path/to/checkpoints # Training checkpoints saved here
export HF_HOME=/path/to/huggingface_cache # HuggingFace model cache
# Q programming experiments
export QHOME=~/q # KDB/Q installation
export Q_INTERPRETER_PATH=$QHOME/l64arm/q # Path to Q binary
# SLURM users only
export SLURM_ACCOUNT=<your-account> # Your cluster allocation
export SLURM_PARTITION=<your-partition> # Your cluster partitionDuring RL training, CBRL prepends curated solved examples to the prompt before the model generates a response:
- Example selection — A configurable number of few-shot examples (default: 2) are sampled from a bank and prepended to the training prompt.
- Probability annealing — Injection probability starts high (e.g., 0.5) and linearly decays to 0.0 over training. Early on, most prompts include demonstrations; by the end, the model solves problems entirely on its own.
This schedule provides strong learning signal when the model needs it most, then removes the scaffolding so the model internalizes the reasoning patterns rather than relying on in-context examples.
Prepare the Q programming dataset (writes train/test parquets into $DATA_DIR/sft_python_q_problems/):
python3 -m data_preprocess.sft_python_q_problems \
--output_dir $DATA_DIR/sft_python_q_problemsGenerate few-shot examples for CBRL (writes $DATA_DIR/qprog_fewshots.json):
python3 -m data_preprocess.make_qprog_fewshots
python3 -m data_preprocess.annotate_qprog_fewshots_tags \
--parquet_dir $DATA_DIR/sft_python_q_problems \
--fewshots_path $DATA_DIR/qprog_fewshots.jsonTraining uses GRPO via the verl framework. All scripts should be run from the repository root.
With SLURM:
Note: Update
<your-account>and<your-partition>inscripts/submit_*.sbatchto match your cluster, or override viasbatch --account=... --partition=....
# Baseline GRPO
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EXPERIMENT_NAME=baseline,FORMAT=raw \
scripts/submit_grpo.sbatch
# CBRL GRPO (with few-shot injection)
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EXPERIMENT_NAME=cbrl,FORMAT=raw,CBRL=true \
scripts/submit_grpo.sbatchWithout SLURM:
# Baseline GRPO
MODEL=morganstanley/qqWen-7B-pretrain EXPERIMENT_NAME=baseline FORMAT=raw \
bash scripts/run_grpo.sh
# CBRL GRPO (with few-shot injection)
MODEL=morganstanley/qqWen-7B-pretrain EXPERIMENT_NAME=cbrl FORMAT=raw CBRL=true \
FEWSHOT_JSON=$DATA_DIR/qprog_fewshots.json \
bash scripts/run_grpo.sh| Parameter | Description | Default |
|---|---|---|
MODEL |
HuggingFace model path (required) | — |
EXPERIMENT_NAME |
W&B experiment name (required) | — |
FORMAT |
raw (pretrain) or chat (instruct) |
chat |
CBRL |
Enable few-shot injection | false |
CHECKPOINT_PREFIX |
Checkpoint directory prefix | grpo |
FEWSHOT_JSON |
Path to few-shot JSON (required if CBRL=true) |
— |
NUM_GPUS |
Number of GPUs (bash only) | 4 |
Evaluate a trained checkpoint (converts FSDP to HF format, then runs pass@k):
# SLURM
sbatch --export=ALL,CHECKPOINT_NAME=<name>,STEP=<step> scripts/submit_eval.sbatch
# Bash
CHECKPOINT_DIR=$CHECKPOINT_DIR/<name> STEP=512 bash scripts/run_eval.shEvaluate a HuggingFace model directly:
# SLURM
sbatch --export=ALL,MODEL=morganstanley/qqWen-7B-pretrain,EVAL_NAME=pretrain scripts/submit_eval_baseline.sbatch
# Bash
MODEL=morganstanley/qqWen-7B-pretrain EVAL_NAME=pretrain \
PARQUET_PATH=$DATA_DIR/sft_python_q_problems/test.parquet \
bash scripts/run_eval_baseline.shAdd FEWSHOT=true and FEWSHOT_JSON=$DATA_DIR/qprog_fewshots.json to evaluate with few-shot prompting.
See reasoning_gym/README.md for setup, training, and evaluation of the Reasoning Gym experiments across ARC-1D, Word Sorting, Spell Backward, Matrix Manipulation, and Puzzle-24.
The Q programming experiments use the qqWen family of models released by Morgan Stanley. We thank them for making these models publicly available.