Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Updates

2025-09-23: 🔥🔥🔥Our paper is accepted on NeurIPS 2025!!!
2025-06-16: 🔥Our paper is accepted on ICML 2025 workshop EXAIT.
2025-05-23: Our paper is available on arxiv.
2025-05-19: We update the next-version code, model and data source.
2025-03-28: We release the TON repo.

Introduction

Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy:

(i) A supervised fine-tuning (SFT) stage with a simple yet effective “thought dropout” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning.
(ii) A GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards.

Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches.

Resources

Models

We release our models both in huggingface and modelscope.

	huggingface	Modelscope
🤗 TON-3B-AITZ	TON-3B-AITZ	TON-3B-AITZ
🤗 TON-3B-CLEVR	TON-3B-CLEVR	TON-3B-CLEVR
🤗 TON-3B-Math	TON-3B-Math	TON-3B-Math
🤗 TON-7B-Math	TON-7B-Math	TON-7B-Math

Datasets

We release our training datasets both in huggingface and modelscope.

	huggingface	Modelscope
🤗 AITZ-SFT	AITZ-SFT	AITZ-SFT
🤗 AITZ-GRPO	AITZ-GRPO	AITZ-GRPO
🤗 Math-SFT	Math-SFT	Math-SFT
🤗 Math-GRPO	Math-GRPO	Math-GRPO

Supported Evaluations

AITZ: Mobile Agent Navigation Problems
GeoQA-Test: Geometry Reasoning

Training

The comparsion of the completion length and skip think ratio during the training process between our TON and vanilla GRPO is as follows:

1. Set up the environments:

conda create -n r1-v python=3.11

conda activate r1-v

bash setup.sh

2. Download the training datasets:

Raw datasets: download and unzip the raw data of three training datasets into the dataset folder and one evaluation benchmark into the src/eval folder with the following links.

datasets	referred download links
AITZ	https://drive.usercontent.google.com/download?id=12xOV2m62fBUFLhMcWIsFiC6zwV7a2RhI&export=download&authuser=0
GeoQA	https://huggingface.co/datasets/leonardPKU/GEOQA_R1V_Train_8K/viewer
CLEVR (item couting)	https://huggingface.co/datasets/leonardPKU/clevr_cogen_a_train
Super-CLEVR (item counting)	The official dataset link is 404 error now 😭.
GeoQA-test	https://huggingface.co/datasets/Luckyjhg/Geo170K

download GeoQA-test

cd src/eval
git lfs install
git clone https://huggingface.co/datasets/Luckyjhg/Geo170K
cd Geo170K
unzip images.zip

Preprocessed SFT datasets: We submit our SFT stage datasets by the random thought dropout ratio 0.5 in TON. Run the following code to download them:

modelscope download --dataset wjqkoko/TON-AITZ-SFT ----local_dir <your local path>
modelscope download --dataset wjqkoko/TON-Math-SFT ----local_dir <your local path>

3. Prepare SFT and GRPO training json datasets

AITZ

Preprocess the train split, general domain of AITZ

python src/eval/aitz_evaluate/process_data.py

change the process data to the SFT and GRPO format of AITZ in dataset folder

python src/eval/aitz_evaluate/sft_grpo_data.py

TODO: CLEVR and GeoQA

4. Download the model

Supported models Qwen2.5-VL-3B/7B

modelscope download --model Qwen/Qwen2.5-VL-7B-Instruct --local_dir <your local path>
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct --local_dir <your local path>

We also provide our models well-trained on AITZ, CLEVR, and GeoQA by our method TON in TON.

modelscope download --model wjqkoko/TON-7B-Math --local_dir <your local path>
modelscope download --model wjqkoko/TON-3B-Math --local_dir <your local path>
modelscope download --model wjqkoko/TON-3B-AITZ --local_dir <your local path>
modelscope download --model wjqkoko/TON-3B-CLEVR --local_dir <your local path>

5. Train the Qwen2.5VL by our TON

SFT stage: We implement the sft stage by Llama-factory. Thanks the comprehensive tools. 😊
GRPO stage: The training command is as follows: you need to modify it by

your local model path (QWEN_PATH),
dataset path (HF_DATASET),
and output save path(OUTPUT_DIR).

We also support wandb to monitor the training process by setting the run name (RUN_NAME).

For Counting/GeoQA

sh src/scripts/run_grpo_vllm_qwen25vl.sh

For AITZ

sh src/scripts/run_grpo_vllm_qwen25vl_gui.sh

Note

If you meet OOM Error, you can try reduce -num_generations

Evaluation

AITZ

We currently give the evaluation codes of the training datasets on src/scripts/llama_factory_test.sh.

You need to first download the LLaMA-Factory and then configure the environments following its readme.

You need to first modify the AITZ raw data to the sft format following the same steps previously, then edit the dataset name and path in dataset_format.json.

You can test your model by running it to generate predicted results of AITZ:

sh src/scripts/llama_factory_test.sh

You can test your output results by running the following code:

python src/eval/aitz_evaluate/test_qwen25vl_aitz_from_json.py

GeoQA Counting

To evaluate on GeoQA and counting, you need to first transform the data format by running the code:

python src/eval/parquet_data.py

You can test your model by running it to evaluate math:

python src/eval/test_qwen25vl_geoqa.py

You can test your model by running it to evaluate counting:

python src/eval/test_qwen25vl_counting_superclevr.py

TON Team

Jiaqi Wang · Qinghong Lin· Binghui Xie· Dongchi Huang · Ming Hu · Xiaojun Guo· Qixun Wang · Qiguang Chen · James Cheng · Mike Z. SHOU

Acknowledgements

We sincerely thank DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal, R1-V (our initial codebase). We sincerely thank Dongchi Huang for his invaluable guidance on the code and for providing essential computational resources. We also appreciate Binghui Xie’s insightful discussion on topic selection and idea suggestions. Additionally, we are grateful to Qiguang Chen and Yuxuan Wan for their thoughtful and constructive feedback on this paper. Finally, we extend our gratitude to Xiaojun Guo and Qixun Wang for their valuable advice on visual reasoning and the GRPO series methods.

Citation

If you find this work useful, please give us a free cite:

@misc{wang2025think,
    title={Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models},
    author={Jiaqi Wang and Kevin Qinghong Lin and James Cheng and Mike Zheng Shou},
    year={2025},
    eprint={2505.16854},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
src		src
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Updates

Introduction

Resources

Models

Datasets

Training

1. Set up the environments:

2. Download the training datasets:

3. Prepare SFT and GRPO training json datasets

4. Download the model

5. Train the Qwen2.5VL by our TON

Evaluation

TON Team

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

kokolerk/TON

Folders and files

Latest commit

History

Repository files navigation

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Updates

Introduction

Resources

Models

Datasets

Training

1. Set up the environments:

2. Download the training datasets:

3. Prepare SFT and GRPO training json datasets

4. Download the model

5. Train the Qwen2.5VL by our TON

Evaluation

TON Team

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages