Skip to content

weijiawu/Awesome-RL-for-Multimodal-Foundation-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 

Repository files navigation

Awesome-RL-for-Multimodal-Foundation-Models

Logo

This repository accompanies our survey paper:
Reinforcement Learning for Multimodal Foundation Models: A Survey


🔔 News

  • [2026-01-28] To better define the scope of the claim, we have renamed the title to "Reinforcement Learning for Multimodal Foundation Models: A Survey".
  • [2025-08-13] We have released "Reinforcement Learning for Large Model: A Survey", the first comprehensive survey dedicated to the emerging paradigm of "RL for Large Model".
  • [2025-08-13] We reorganized the repository and aligned the classifications in the survey.
  • [2025-06-08] We created this repository to maintain a paper list on Awesome-Visual-Reinforcement-Learning. Everyone is welcome to push and update related work!

🤔 What is Reinforcement Learning of Multimodal Foundation Models?

Reinforcement Learning for Multimodal Foundation Models enables agents to learn decision-making policies directly from visual observations (e.g., images or videos), rather than structured state inputs. It lies at the intersection of reinforcement learning and computer vision, with applications in robotics, embodied AI, games, and interactive environments.

📌 Project Description

Awesome-Visual-Reinforcement-Learning is a curated list of papers, libraries, and resources on learning control policies from visual input. It aims to help researchers and practitioners navigate the fast-evolving Visual RL landscape — from perception and representation learning to policy learning and real-world applications.

TAX

We structure this collection along a trajectory of visual RL. This chart ugroups existing work by high-level domain (MLLMs, visual generation, unified models, and vision-language action agents) and then by finer-grained tasks, illustrating representative papers for each branch.:

TAX

📚 Table of Contents

Libraries and tools

Benchmarks environments and datasets with Visual RL

MLLM

Multi-Agent

Multi-Modal Large Language Models with RL

Conventional RL-based Frameworks for MLLM

Definition: We refer to conventional RL-based MLLMs as approaches that apply reinforcement learning primarily to align a vision–language backbone with verifiable, task-level rewards, without explicitly modeling multi-step chain-of-thought reasoning.

Spatial & 3D Perception with RL for MLLMs

Definition: Perception centric works applies RL to sharpen object detection, segmentation and grounding without engaging in lengthy chain–of–thought reasoning.

Image Reasoning with RL for MLLMs

Think about Image
Think with Image

Definition: Thinking with Images elevates the picture to an active, external workspace: models iteratively generate, crop, highlight, sketch or insert explicit visual annotations as tokens in their chain-of-thought, thereby aligning linguistic logic with grounded visual evidence.

Video Reasoning with RL for MLLMs

Visual Generation with RL

Definition: Study RL agents that generate or manipulate visual content to achieve goals or enable creative visual tasks.

Image Generation

Image Editing

Video Generation

3D Generation

RL for Unified Model

Unified RL

Task Specific RL

Vision Language Action Models with RL

GUI Interaction

Visual Navigation

Visual Manipulation

Discussion: Why RL Works for Vision

RL versus SFT: Why Do We Need RL?

Others

Pretrain with RL

Representation Learning

Audio Question Answering with RL

RL for Medical Reasoning

Visual World Models with RL

Definition: Learn predictive models of environment dynamics from visual inputs to enable planning and long-horizon reasoning in RL.

Blog

Learning Course

Other Survey

Acknowledgements

This template is provided by Awesome-Video-Diffusion. And our approach builds upon numerous contributions from prior resource, such as Awesome Visual RL.

Contribute!

🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.

🤖 Try our Awesome-Paper-Agent. Just provide an arXiv URL link, and it will automatically return formatted information, like this:

User:
https://arxiv.org/abs/2312.13108

GPT:
+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023)

  [![Star](https://img.shields.io/github/stars/showlab/assistgui.svg?style=social&label=Star)](https://github.com/showlab/assistgui)
  [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.13108)
  [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/assistgui/)

So then you can easily copy and use this information in your pull requests.

⭐ If you find this repository useful, please give it a star.

⭐ Citation

If you have any suggestions (missing papers, new papers, or typos), please feel free to edit and submit a pull request. Even just suggesting paper titles is a great contribution — you can also open an issue or contact us via email (weijiawu96@gmail.com).

If you find our survey and this repository useful for your research, please consider citing our work:

@article{wu2025reinforcement,
  title={Reinforcement Learning in Vision: A Survey},
  author={Wu, Weijia and Gao, Chen and Chen, Joya and Lin, Kevin Qinghong and Meng, Qingwei and Zhang, Yiming and Qiu, Yuke and Zhou, Hong and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2508.08189},
  year={2025}
}

Star History

Star History Chart

About

📖 This is a repository for organizing papers, codes and other resources related to Visual Reinforcement Learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors