oel

Online Experiential Learning for Language Models

This repository contains the implementation for our paper "Online Experiential Learning for Language Models".

The code is built on VeRL. We provide online experiential learning code for two environments: Frozen Lake and Sokoban.

On-Policy Context Distillation is our preceding work. We open-source its code at OPCD-Code, which includes mathematical reasoning, text-based game tasks for experiential knowledge distillation, and system prompt distillation. Off-policy context distillation is also implemented in that codebase. Feel free to refer to it if needed.

🚀 Environment Setup

If you use A100, H100 or H200:

bash run_docker.sh
cd /tmp ; git clone --depth 1 https://github.com/microsoft/LMOps.git
cd /tmp/LMOps/oel
bash setup.sh
bash ray_node_setup.sh

If you use B200:

bash run_docker_b200.sh
cd /tmp ; git clone --depth 1 https://github.com/microsoft/LMOps.git
cd /tmp/LMOps/oel
bash setup_b200.sh
bash ray_node_setup.sh
source .venv/bin/activate

📖 Code Walkthrough

Main Entrance: Ray Trainer

Rollout: Rollout and TextGame Rollout

Update Policy: Update and Reverse KL

📦 Usage

First login your wandb account:

export WANDB_PROJECT=${YOUR_WANDB_PROJECT} ; export WANDB_API_KEY=${YOUR_WANDB_KEY}

Sokoban, Qwen3-4B-Instruct-2507 (non-thinking model)

Round 1

# 1. Experiential knowledge extraction.
# The 'CKPT' passed in actually represents different random seed.
# You can split them to multiples jobs.
# We accumulate 100 steps (to show the saturation) and use 50 steps to consolidate.
# No training, just inference.
bash scripts/textgame_extract_inturn.sh "oel-sokoban-q3-4b-ins-ext-v4-selwp-round1,50,500,50,Qwen/Qwen3-4B-Instruct-2507,,v4,100,True,8192,Sokoban-v0,1024,5,True,1," ; python tools/make_exp_list.py "oel-sokoban-q3-4b-ins-ext-v4-selwp-round1,50,500,50,100,50"

# 2. Collection of user trajectories, to construct partial rollouts later.
# No training, just inference.
bash scripts/textgame_generate_deploy.sh --model Qwen/Qwen3-4B-Instruct-2507 --exp_name oel-sokoban-q3-4b-ins-round1-deploy --nnodes 1 --oel_round 1 --experience_max_length 8192 --textgame_name Sokoban-v0 --max_response_length 1024 --textgame_max_steps 5 --textgame_no_think True --total_training_steps 100

# 3. Experiential knowledge consolidation.
# Training.
bash scripts/textgame_consolidate.sh --model Qwen/Qwen3-4B-Instruct-2507 --exp_name oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round1 --nnodes 2 --oel_round 1 --kl_loss_type full --kl_topk 256 --actor_lr 1e-6 --experience_max_length 8192 --textgame_name Sokoban-v0 --max_response_length 1024 --textgame_max_steps 5 --textgame_no_think True --deploy_save_dir /tmp/oel-sokoban-q3-4b-ins-round1-deploy/deploy_data --exp_path /tmp/oel-sokoban-q3-4b-ins-ext-v4-selwp-round1/experience_list.txt --total_training_steps 100 --save_freq 2

# (Optional) Evaluate checkpoints during consolidation.
# No training, just inference.
bash scripts/textgame_eval_inturn.sh "oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round1,2,100,2,Qwen/Qwen3-4B-Instruct-2507,false,1024,Sokoban-v0,5,true"

Round 2

# 1. Experiential knowledge extraction.
bash scripts/textgame_extract_inturn.sh "oel-sokoban-q3-4b-ins-ext-v4-selwp-round2,50,500,50,oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round1,100,v4,100,True,8192,Sokoban-v0,1024,5,True,2," ; python tools/make_exp_list.py "oel-sokoban-q3-4b-ins-ext-v4-selwp-round2,50,500,50,100,50"

# 2. Collection of user trajectories, to construct partial rollouts later.
bash scripts/textgame_generate_deploy.sh --resume_policy_name oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round1 --resume_policy_ckpt 100 --exp_name oel-sokoban-q3-4b-ins-round2-deploy --nnodes 1 --oel_round 2 --experience_max_length 8192 --textgame_name Sokoban-v0 --max_response_length 1024 --textgame_max_steps 5 --textgame_no_think True --total_training_steps 100

# 3. Experiential knowledge consolidation.
bash scripts/textgame_consolidate.sh --resume_policy_name oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round1 --resume_policy_ckpt 100 --exp_name oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round2 --nnodes 2 --oel_round 2 --kl_loss_type full --kl_topk 256 --actor_lr 1e-6 --experience_max_length 8192 --textgame_name Sokoban-v0 --max_response_length 1024 --textgame_max_steps 5 --textgame_no_think True --deploy_save_dir /tmp/oel-sokoban-q3-4b-ins-round2-deploy/deploy_data --exp_path /tmp/oel-sokoban-q3-4b-ins-ext-v4-selwp-round2/experience_list.txt --total_training_steps 100 --save_freq 2

Round 3

# 1. Experiential knowledge extraction.
bash scripts/textgame_extract_inturn.sh "oel-sokoban-q3-4b-ins-ext-v4-selwp-round3,50,500,50,oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round2,100,v4,100,True,8192,Sokoban-v0,1024,5,True,3," ; python tools/make_exp_list.py "oel-sokoban-q3-4b-ins-ext-v4-selwp-round3,50,500,50,100,50"

# 2. Collection of user trajectories, to construct partial rollouts later.
bash scripts/textgame_generate_deploy.sh --resume_policy_name oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round2 --resume_policy_ckpt 100 --exp_name oel-sokoban-q3-4b-ins-round3-deploy --nnodes 1 --oel_round 3 --experience_max_length 8192 --textgame_name Sokoban-v0 --max_response_length 1024 --textgame_max_steps 5 --textgame_no_think True --total_training_steps 100

# 3. Experiential knowledge consolidation.
bash scripts/textgame_consolidate.sh --resume_policy_name oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round2 --resume_policy_ckpt 100 --exp_name oel-sokoban-q3-4b-ins-v4-selwp-lr1e-6-round3 --nnodes 2 --oel_round 3 --kl_loss_type full --kl_topk 256 --actor_lr 1e-6 --experience_max_length 8192 --textgame_name Sokoban-v0 --max_response_length 1024 --textgame_max_steps 5 --textgame_no_think True --deploy_save_dir /tmp/oel-sokoban-q3-4b-ins-round3-deploy/deploy_data --exp_path /tmp/oel-sokoban-q3-4b-ins-ext-v4-selwp-round3/experience_list.txt --total_training_steps 100 --save_freq 2

Frozen Lake, Qwen3-1.7B, Qwen3-4B, Qwen3-8B (thinking model)

See usage_example.sh.

📄 Citation

If you find this work useful, please cite our paper:

@article{ye2026onlineexperientiallearninglanguage,
    title={Online Experiential Learning for Language Models}, 
    author={Tianzhu Ye and Li Dong and Qingxiu Dong and Xun Wu and Shaohan Huang and Furu Wei},
    year={2026},
    eprint={2603.16856},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2603.16856}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Online Experiential Learning for Language Models

🚀 Environment Setup

📖 Code Walkthrough

📦 Usage

Sokoban, Qwen3-4B-Instruct-2507 (non-thinking model)

Round 1

Round 2

Round 3

Frozen Lake, Qwen3-1.7B, Qwen3-4B, Qwen3-8B (thinking model)

📄 Citation

Name		Name	Last commit message	Last commit date
parent directory ..
scripts		scripts
tools		tools
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ray_node_setup.sh		ray_node_setup.sh
requirements_b200.txt		requirements_b200.txt
run_docker.sh		run_docker.sh
run_docker_b200.sh		run_docker_b200.sh
setup.sh		setup.sh
setup_b200.sh		setup_b200.sh
usage_example.sh		usage_example.sh

FilesExpand file tree

oel

Directory actions

More options

Directory actions

More options

Latest commit

History

oel

Folders and files

parent directory

README.md

Online Experiential Learning for Language Models

🚀 Environment Setup

📖 Code Walkthrough

📦 Usage

Sokoban, Qwen3-4B-Instruct-2507 (non-thinking model)

Round 1

Round 2

Round 3

Frozen Lake, Qwen3-1.7B, Qwen3-4B, Qwen3-8B (thinking model)

📄 Citation