Skip to content

zli12321/free-form-grpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Free Form R1 Training

This repository contains the code and resources for training models on long-form reinforcement learning tasks for our paper Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation.

TODOs:

  • Upload dev dataset.
  • Upload sample model generated answers
  • Upload vllm response inference code

[📖Paper] [🤗RewardBERTModel] [Training Curves]

🚀 Getting Started

To get started with this project, follow the steps below to clone the repository and set up your environment.

1. Clone the Repository

git clone https://github.com/zli12321/long_form_rl.git
cd long_form_rl/OpenRLHF

2. Install Dependencies

Install the necessary Python packages using pip.

pip install -e .
pip install qa-metrics

Training free-form reward model

cd train_reward_bert
python reward_bert.py

🏋️ Training

Once the setup is complete, you can begin training the model using the provided scripts.

1. Navigate to the Training Scripts

cd ../scripts/no-cot

2. Configure Your Training Run

Before launching the training, you must edit the grpo_preferenceBert.sh script to match your environment settings.

Open grpo_preferenceBert.sh and update the following variables:

  • working_dir
  • remote_rm_url
  • save_path
  • use_wandb

3. Run the Training Script

./grpo_preferenceBert.sh

📈 Evaluation

Evaluation procedures are currently under development and will be released soon.

The planned evaluation method involves using our provided template to prompt GPT-4o to generate scores twice for each output. The final score will be the average of the two generated scores.

📝 Notes

  • Please ensure all script paths and configurations are adjusted to fit your specific setup.
  • If you encounter any issues or have questions, please feel free to open an issue or submit a pull request on our GitHub repository. We welcome your contributions!

Citations

If you find our work helpful for your research, please consider citing our work.

@misc{li2025semanticallyawarerewardsopenendedr1,
      title={Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation}, 
      author={Zongxia Li and Yapei Chang and Yuhang Zhou and Xiyang Wu and Zichao Liang and Yoo Yeon Sung and Jordan Lee Boyd-Graber},
      year={2025},
      eprint={2506.15068},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.15068}, 
}

About

grpo to train long form QA and instructions with long-form reward model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors