Authors: Andre He, Daniel Fried, Sean Welleck
This repository contains the official implementation for the paper "Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening".
This codebase is based on veRL and is designed for RL training of Lean-based theorem-proving models. The Lean verifier is integrated into the online RL loop to provide rewards for model-generated proofs.
Follow the installation and setup instructions provided in the veRL repository to initialize your environment.
This project uses a verifier adapted from DeepSeek-Prover-V1.5. To get started:
- Clone and install the DeepSeek-Prover-V1.5 repository.
- Either install it globally or update the path to the verifier in
verl/lean/verifier.py. - Install the Lean REPL, which is required to interact with the Lean environment during proof checking.
-
To launch RL training using GRPO, use the script:
examples/lean/grpo.sh -
The core training loop and implementation of unlikeliness reward are in
verl/trainer/ppo/ray_lean_trainer.py. -
To evaluate trained checkpoints via sampling:
examples/inference/run_sampling.sh -
To analyze results, compute pass rates, and generate performance plots, see:
misc/paper_figures.ipynb
To train on your own collection of Lean theorems, you must first convert your dataset into a .parquet file. Examples of dataset formats and splits can be found in the data directory:
minif2f_trainandminif2f_test: Standard splits of MiniF2F.mff-lwb-goedel-28k.parquet: A concatenation ofminif2f_trainwith ~30k problems from Lean-Workbook that were successfully solved by Godel-Prover, excluding problems held out for validation.mff-lwb-10k-seen: A similarly constructed dataset using theorems solved by InternLM2.5-step-prover.
These datasets do not include proofs found by prior work; they are only used to identify provable theorems.