This repository contains the official implementation of FAST-GRPO (Fast-Slow Thinking Group Relative Policy Optimization), achieving high performance in applying fast-slow thinking to both visual and textual reasoning.
# Clone the repository
git clone https://github.com/Mr-Loevan/FAST-GRPO.git
cd FAST-GRPO
# Create conda environment
conda create -n fast_grpo python=3.11
conda activate fast_grpo
# Install dependencies (Refer to EasyR1 installation)
pip install -r requirements.txt
pip install -e .# Run training with default configuration
bash examples/train_fast_llm.shFAST-GRPO introduces three key innovations that work together to achieve fast-slow reasoning:
1. Thinking Reward Function
The Thinking Reward Function (examples/reward_function/thinking_reward.py) implements an adaptive difficulty-aware reward mechanism:
- Adaptive Difficulty:
difficulty = (1 - pass_rate) * normalized_complexity - Differentiated Rewards:
- Easy problems (< 80th percentile) and correct answer: Rewards concise solutions
- Hard problems (> 80th percentile) and incorrect answer: Rewards exploration efforts
2. Dynamic KL Penalty
Implements group-based adaptive KL divergence control for stable training:
# Configuration in config.yaml
algorithm:
kl_penalty: low_var_kl
kl_coef: 1.0e-2
kl_type: "group_accuracy_based"
kl_min_coef: 0.001 # β_min
kl_max_coef: 0.01 # β_max- Group-based Adaptation: Adjusts KL coefficient based on group performance
3. Slow2Fast Sampling
Progressive curriculum learning that gradually increases training difficulty:
# Configuration in config.yaml
algorithm:
online_filtering: true
filter_key: accuracy
dynamic_filter_schedule:
- epoch_ratio: 0.5
filter_low: 0.3
filter_high: 0.99
- epoch_ratio: 1.0
filter_low: 0.01
filter_high: 0.7 - Phase 1 (0-50%): Learn from medium-to-high difficulty samples for slow thinking
- Phase 2 (50-100%): Include easy samples for fast-thinking
# Use provided script (recommended)
bash examples/train_fast_llm.sh| Model | Base Model | Download |
|---|---|---|
| FAST-1.5B | DeepSeek-R1-Distill-Qwen-1.5B | ModelScope |
| FAST-3B | Qwen-2.5-VL-3B | ModelScope |
| FAST-7B | Qwen-2.5-VL-7B | Coming Soon |
| FAST-4B | Qwen-3-VL-4B | Coming Soon |
| Method | GSM8K (Acc) | GSM8K (Length) | MATH 500 (Acc) | MATH 500 (Length) | AIME 2024 (Acc) | AIME 2024 (Length) |
|---|---|---|---|---|---|---|
| FAST-1.5B | 86.8 | 851 | 85.8 | 2645 | 34.17 | 8003 |
Note: Length denotes the number of generated tokens.
If you find this work useful, please cite our paper:
@inproceedings{xiao2025fastslow,
title={Fast-Slow Thinking {GRPO} for Large Vision-Language Model Reasoning},
author={Wenyi Xiao and Leilei Gan},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=MI1uT5rReV}
}This project is licensed under the Apache 2.0 License.