Awesome-RL-for-Multimodal-Foundation-Models

This repository accompanies our survey paper:
Reinforcement Learning for Multimodal Foundation Models: A Survey

🔔 News

[2026-01-28] To better define the scope of the claim, we have renamed the title to "Reinforcement Learning for Multimodal Foundation Models: A Survey".
[2025-08-13] We have released "Reinforcement Learning for Large Model: A Survey", the first comprehensive survey dedicated to the emerging paradigm of "RL for Large Model".
[2025-08-13] We reorganized the repository and aligned the classifications in the survey.
[2025-06-08] We created this repository to maintain a paper list on Awesome-Visual-Reinforcement-Learning. Everyone is welcome to push and update related work!

🤔 What is Reinforcement Learning of Multimodal Foundation Models?

Reinforcement Learning for Multimodal Foundation Models enables agents to learn decision-making policies directly from visual observations (e.g., images or videos), rather than structured state inputs. It lies at the intersection of reinforcement learning and computer vision, with applications in robotics, embodied AI, games, and interactive environments.

📌 Project Description

Awesome-Visual-Reinforcement-Learning is a curated list of papers, libraries, and resources on learning control policies from visual input. It aims to help researchers and practitioners navigate the fast-evolving Visual RL landscape — from perception and representation learning to policy learning and real-world applications.

We structure this collection along a trajectory of visual RL. This chart ugroups existing work by high-level domain (MLLMs, visual generation, unified models, and vision-language action agents) and then by finer-grained tasks, illustrating representative papers for each branch.:

📚 Table of Contents

Libraries and tools

Benchmarks environments and datasets with Visual RL
Multi-Modal Large Language Models with RL
Visual Generation with RL
RL for Unified Model
Vision Language Action Models with RL
Others

Benchmarks environments and datasets with Visual RL

MLLM

Multi-Agent

FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning (2024, ICML)

Multi-Modal Large Language Models with RL

Conventional RL-based Frameworks for MLLM

Definition: We refer to conventional RL-based MLLMs as approaches that apply reinforcement learning primarily to align a vision–language backbone with verifiable, task-level rewards, without explicitly modeling multi-step chain-of-thought reasoning.

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization (Jan. 2026)
RL makes MLLMs see better than SFT (Oct. 2025)
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning (Sep. 2025)
THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning (Sep. 2025)
LENS: Learning to Segment Anything with Unified Reinforced Reasoning (Aug. 2025)
DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding (Aug. 2025)
BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning (Aug. 2025)

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning (Apr. 2025)
MMSearch-R1: Incentivizing LMMs to Search (Jun. 2025)
Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation (May. 2025)

Spatial & 3D Perception with RL for MLLMs

Definition: Perception centric works applies RL to sharpen object detection, segmentation and grounding without engaging in lengthy chain–of–thought reasoning.

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration (May. 2025)
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes (Mar. 2025)
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse (Mar. 2025)
BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning (Jun. 2025)
Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations (Jun. 2025)
Perception-R1: Pioneering Perception Policy with Reinforcement Learning (Apr. 2025)
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning (Mar. 2025)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs (Jun. 2025)
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning (Mar. 2025)
Grounded Reinforcement Learning for Visual Reasoning (Mar. 2025)
Visual-RFT: Visual Reinforcement Fine-Tuning (Mar. 2025)
Video-R1: Reinforcing Video Reasoning in MLLMs (Mar. 2025)

Image Reasoning with RL for MLLMs

Think about Image

GLM‑4.5V and GLM‑4.1V‑Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning (Jul. 2025)

Think with Image

Definition: Thinking with Images elevates the picture to an active, external workspace: models iteratively generate, crop, highlight, sketch or insert explicit visual annotations as tokens in their chain-of-thought, thereby aligning linguistic logic with grounded visual evidence.

Video Reasoning with RL for MLLMs

Visual Generation with RL

Definition: Study RL agents that generate or manipulate visual content to achieve goals or enable creative visual tasks.

Image Generation

Unified Personalized Reward Model for Vision Generation (Mar. 2026)
Unified Personalized Reward Model for Vision Generation (Feb. 2026)
DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment (Jan. 2026)
GARDO: Reinforcing Diffusion Models without Reward Hacking (Dec. 2025)
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation (Jan. 2026)
MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency (Oct. 2025)
Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation (Oct. 2025)
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward (Sep. 2025)
OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning (Aug. 2025)
USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning (Aug. 2025)
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning (Jul. 2025)
TempFlow-GRPO: When Timing Matters for GRPO in Flow Models (Aug. 2025)
Qwen-Image Technical Report (Aug. 2025)

Image Editing

Video Generation

Manifold-Aware Exploration for Reinforcement Learning in Video Generation (Mar. 2026)
WorldCompass: Reinforcement Learning for Long-Horizon World Models (Nov. 2025)
PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models (Jan. 2026)
PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation (Nov. 2025)
Video Generation Models Are Good Latent Reward Models (Nov. 2025)
Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation (Nov. 2025)
RewardDance: Reward Scaling in Visual Generation (Sep. 2025)

DanceGRPO: Unleashing GRPO on Visual Generation (Jun. 2025)
InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO (May. 2025)
Reasoning physical video generation with diffusion timestep tokens via reinforcement learning (Apr. 2025)
Improving Video Generation with Human Feedback (Jan. 2025)
TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning (May. 2025)
Aligning Anime Video Generation with Human Feedback (Apr. 2025)
Instructvideo: Instructing video diffusion models with human feedback (Dec. 2023)
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation (Jun. 2024)
Gradeo: Towards human-like evaluation for text-to-video generation via multi-step reasoning (Mar. 2025)
Boosting text-to-video generative model with MLLMs feedback (NIPS 2024)

3D Generation

RL for Unified Model

Unified RL

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image (Dec. 2025)

Task Specific RL

Vision Language Action Models with RL

Discussion: Why RL Works for Vision

RL versus SFT: Why Do We Need RL?

Others

Pretrain with RL

Reinforcement Pre‑Training (Jun. 2025)

Representation Learning

Visual Pre-Training on Unlabeled Images using Reinforcement Learning (Jun. 2025)
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models (Mar. 2025)

Audio Question Answering with RL

Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering (Mar. 2025)

RL for Medical Reasoning

Visual World Models with RL

Definition: Learn predictive models of environment dynamics from visual inputs to enable planning and long-horizon reasoning in RL.

Mastering Diverse Domains through World Models (2025, Nature)
CoWorld: Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning (2024, NeurIPS)
LS-Imagine: Open-World Reinforcement Learning over Long Short-Term Imagination (2025, ICLR oral)

Blog

Reinforcement Learning Guide (2025, Blog)
Can RL From Pixels be as Efficient as RL From State? (Jl. 2025, Blog) u
The 37 Implementation Details of Proximal Policy Optimization (Mar. 2022, ICLR Blog)
Group Relative Policy Optimization (GRPO) Illustrated Breakdown & Explanation (Jl. 2025, Blog)

Learning Course

Deep Reinforcement Learning Course from Hugging Face (Hugging Face)

Other Survey

Acknowledgements

This template is provided by Awesome-Video-Diffusion. And our approach builds upon numerous contributions from prior resource, such as Awesome Visual RL.

Contribute!

🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.

🤖 Try our Awesome-Paper-Agent. Just provide an arXiv URL link, and it will automatically return formatted information, like this:

User:
https://arxiv.org/abs/2312.13108

GPT:
+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023)

  [![Star](https://img.shields.io/github/stars/showlab/assistgui.svg?style=social&label=Star)](https://github.com/showlab/assistgui)
  [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.13108)
  [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/assistgui/)

So then you can easily copy and use this information in your pull requests.

⭐ If you find this repository useful, please give it a star.

⭐ Citation

If you have any suggestions (missing papers, new papers, or typos), please feel free to edit and submit a pull request. Even just suggesting paper titles is a great contribution — you can also open an issue or contact us via email (weijiawu96@gmail.com).

If you find our survey and this repository useful for your research, please consider citing our work:

@article{wu2025reinforcement,
  title={Reinforcement Learning in Vision: A Survey},
  author={Wu, Weijia and Gao, Chen and Chen, Joya and Lin, Kevin Qinghong and Meng, Qingwei and Zhang, Yiming and Qiu, Yuke and Zhou, Hong and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2508.08189},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
assets		assets
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome-RL-for-Multimodal-Foundation-Models

🔔 News

🤔 What is Reinforcement Learning of Multimodal Foundation Models?

📌 Project Description

📚 Table of Contents

Benchmarks environments and datasets with Visual RL

MLLM

Multi-Agent

Multi-Modal Large Language Models with RL

Conventional RL-based Frameworks for MLLM

Spatial & 3D Perception with RL for MLLMs

Image Reasoning with RL for MLLMs

Think about Image

Think with Image

Video Reasoning with RL for MLLMs

Visual Generation with RL

Image Generation

Image Editing

Video Generation

3D Generation

RL for Unified Model

Unified RL

Task Specific RL

Vision Language Action Models with RL

GUI Interaction

Visual Navigation

Visual Manipulation

Discussion: Why RL Works for Vision

RL versus SFT: Why Do We Need RL?

Others

Pretrain with RL

Representation Learning

Audio Question Answering with RL

RL for Medical Reasoning

Visual World Models with RL

Blog

Learning Course

Other Survey

Acknowledgements

Contribute!

⭐ Citation

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages