Awesome-Any-to-Any-Generation

📣News

✨✨✨We are organizing a CVPR26 Workshop on Any-to-Any Multimodal Learning, welcome submission.

🎨 Introduction

Traditional generative models are typically designed for a fixed input–output modality pair (e.g., text-to-image or image-to-text). However, real-world multimodal intelligence requires the ability to flexibly generate across arbitrary modality combinations, including multi-input and multi-output settings.

This repository aims to systematize Any-to-Any Multimodal Intelligences, where models can accept inputs from arbitrary modalities and produce outputs in arbitrary modalities within a unified framework.

What qualifies as Any-to-Any Generation?

A model/system is considered Any-to-Any if it satisfies at least one of the following:

Supports arbitrary combinations of input modalities and output modalities within a single unified framework;
Enables multi-input and/or multi-output generation without task-specific retraining;
Relies on a modality-agnostic intermediate representation (e.g., shared latent space, discrete tokens, structured programs);
Demonstrates compositional generalization to unseen modality mappings.

📕 Table of Content

🌷 Datasets
📃Papers
🐱‍🚀 Miscellaneous

🌷 Datasets

📃Papers

Any-to-Any

Any-to-Any generation refers to unified systems that can take inputs from multiple modalities (e.g., text/image/video/audio) and produce outputs in multiple modalities within a single framework.

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
🏷️: Agentic System|📄🎨🔊🎶🧊...
OmniGAIA: Towards Native Omni-Modal AI Agents
🏷️: AR|📄🎨🎤🎬
Beyond Language Modeling: An Exploration of Multimodal Pretraining
🏷️: AR|📄🎨
AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation
🏷️: llm|AR|📄🎨🎤
STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning
🏷️: llm|diffusion|📄🎨
Symbolic Representation for Any-to-Any Generative Tasks
🏷️: llm|diffusion|📄🎬🎨🧊
Easy, fast, and cheap omni-modality model serving for everyone
🏷️: mllm|Talker|📄🎬🎨🔊
AToken: A Unified Tokenizer for Vision
🏷️:* Unified Vision tokenizer|🎬🎨🧊
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
🏷️:* llm|moe|📄🎬🎨🔊🎤
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
🏷️:* llm|MMDiT|📄🎬🎨🔊🎤
Ming-Omni: A Unified Multimodal Model for Perception and Generation
🏷️:* Ling|flow|📄🎬🎨🔊🎤
Qwen2.5-Omni Technical Report
🏷️:* llm|flow|📄🎬🎨🔊🎶🎤
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
🏷️:* mllm|📄🎬🎨🎶🎤
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech, and Multimodal Live Streaming on Your Phone
🏷️:* mllm|📄🎬🎨🎤
Baichuan-Omni-1.5 Technical Report
🏷️:* mllm|📄🎬🎨🎤
Show-o2: Improved Native Unified Multimodal Models
🏷️: llm|flow|📄🎬🎨
Baichuan-Omni-1.5 Technical Report
🏷️: llm|audio decoder|📄🎬🎨🎤
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
🏷️: llm|diffusion|📄🎬🎨
CoDi2: In-Context, Interleaved, and Interactive Any-to-Any Generation
🏷️: llm|diffusion|📄🎬🎨🔊
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
🏷️: transformer encoder-decoder|📄🎬🎨🔊🤖
C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
🏷️: diffusion|📄🎨🔊
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
🏷️: diffusion|🎬🎨🔊
ModaVerse: Efficiently Transforming Modalities with LLMs
🏷️: llm|diffusion|📄🎬🎨🔊
NExT-GPT: Any-to-Any Multimodal LLM
🏷️: llm|diffusion|📄🎬🎨🔊
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
🏷️: llm|tokenizer|📄🎨🎶🎤
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
🏷️: transformer encoder-decoder|📄🎨
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
🏷️: masked modeling|transformer encoder-decoder|📄🎨
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems
🏷️: agent|📄🎬🎨
X-VILA: Cross-Modality Alignment for Large Language Model
🏷️: llm|diffusion|📄🎬🎨🔊
C3LLM: Conditional Multimodal Content Generation Using Large Language Models
🏷️: transformer encoder-decoder|📄🎬🔊
M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
🏷️: llm|📄🎬🎨🔊🎶
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
🏷️: mllm|📄🎬🎨🔊🎤
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
🏷️: mllm|📄🎬🎨🔊🎶
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
🏷️: transformer encoder-decoder|📄🎨🔊
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
🏷️: llm|📄🎬🔊🎤🎶
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
🏷️: llm|📄🎬🎨🔊🎤
CoDi: Any-to-Any Generation via Composable Diffusion
🏷️: diffusion|📄🎬🎨🔊
4M: Massively Multimodal Masked Modeling
🏷️: masked modeling|transformer encoder-decoder|📄🎨

Any-to-X (output-centric)

Any-to-X methods accept flexible inputs (potentially multi-modal, such as text + image + audio) but generate a single target modality. This setting is often practically useful (e.g., “any condition → text report”, “any condition → image synthesis”, “any condition → video generation”), and it highlights how systems fuse heterogeneous conditions and maintain faithfulness to each input. Compared to fully general Any-to-Any systems, Any-to-X typically has a simpler decoding interface, but still demands strong cross-modal alignment and robust conditioning mechanisms.

Any-to-Text

Any-to-Text focuses on producing textual outputs (captioning, explanation, dialogue, reasoning traces, instruction-following) from arbitrary visual/audio/3D/video inputs.

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
🏷️: Qwen2.5-Omni|📄🎬🎨🔊
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
🏷️: Qwen2.5-VL|📄🎬🎨🔊
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
🏷️: Qwen2.5-Omni-7B-thinker|📄🎬🎤
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
🏷️: Qwen2.5-Omni-7B|📄🎬🎨🔊🎤
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
🏷️: mllm|📄🎨🎤
A Reason-then-Describe Instruction Interpreter for Controllable Video Generation
🏷️: Qwen2.5-Omni-7B|📄🎬🎨🎥🏃🏻🔊🎤
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
🏷️: Qwen2.5-Omni-3/7B|📄🎬🔊🎤
EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
🏷️: Qwen2.5-Omni-3/7B|📄🎬🔊🎤
Ola: Pushing the Frontiers of Omni-Modal Language Model
🏷️: Qwen-2.5-7B|📄🎬🎨🔊🎤
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
🏷️: Qwen2.5-VL/Qwen2-Audio|📄🎬🎨🔊🎤
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
🏷️: Qwen2.5-VL|📄🎬🎨🎥🏃🏻
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
🏷️: mllm|📄🎬🎨
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
🏷️: mllm|📄🎬🎨
EMU: GENERATIVE PRETRAINING IN MULTIMODALITY
🏷️: mllm|📄🎬🎨
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
🏷️: llm|moe|📄🎬🎨🔊🎤
X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning
🏷️: llm|📄🎬🎨🔊🧊
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
🏷️: llm|📄🎬🎨
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
🏷️: llm|📄🎬🎨🔊
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
🏷️: llm|📄🎨🔊
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
🏷️: llm|📄🎬🎨🔊
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
🏷️: llm|📄🎬🎨🔊
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
🏷️: llm|modality alignment|📄🎬🎨🔊

Any-to-Image

Any-to-Image methods generate images conditioned on diverse inputs beyond text, such as images, sketches, poses, layouts, audio cues, or multi-modal prompts.

Any-to-Video

Any-to-Video targets video generation from flexible conditions (text/image/video/audio/trajectory/layout).

OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Details
Text / First-frame / Key-frame / Video / Reference / Compositional Multi-image / Text-Image / Reasoning-augmented to video generation
Seedance 2.0
Unified multimodal video-audio joint generation framework
Seedance 2.0 significantly enhances its multimodal processing capabilities, supporting highly diverse and flexible mixed-modality inputs. Users can provide up to nine images, three video clips, three audio segments, along with natural language instructions simultaneously. This design enables the model to draw from multiple reference sources within a single creative task, rather than being limited to a single image or text prompt.
SkyReels-V3 Technique Report
Videopoet:A large language model for zero-shot video generation
VideoComposer: Compositional Video Synthesis with Motion Controllability

X-to-Any (input-centric)

X-to-Any methods start from a fixed input modality but aim to generate multiple output modalities (e.g., text → image/video/audio; image → text/video/audio). This setting is useful for studying whether a model learns a shared multimodal representation that can be decoded into different modalities. Compared to Any-to-X, the emphasis is on multi-head decoding and output diversity, often requiring modality-specific decoders while sharing a common backbone or latent space.

Text-to-Any

Text-to-Any expands classic text-to-image into text-conditioned generation across multiple modalities, such as video, audio, music, speech, and even structured outputs. Typical solutions include unified diffusion/flow backbones, discrete token modeling, or LLM-centered generation that routes to modality experts.

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
🏷️: Diffusion|🎬🎨🔊🎶🎤

Image-to-Any

Image-to-Any aims to generate other modalities from visual input, such as image → text (captioning/VQA), image → video (animation), image → audio (foley/sound), or image → 3D (reconstruction). The main technical challenge is learning mappings from static visual cues to modalities with missing dimensions (e.g., time, sound source, geometry), which often requires strong priors, world knowledge, or intermediate structured representations.

Any Alignment

The Platonic Representation Hypothesis

Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.
Conventionally, different AI systems represent the world in different ways. A vision system might represent shapes and colors, a language model might focus on syntax and semantics. However, in recent years, the architectures and objectives for modeling images and text, and many other signals, are becoming remarkably alike. Are the internal representations in these systems also converging?
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
🏷️: Binding modalities with languages|📄🎨🎬🔊
Meta-Transformer: A Unified Framework for Multimodal Learning
🏷️: Binding modalities with unified representations|📄🎨🔊🧊
ImageBind: One Embedding Space To Bind Them All
🏷️: Binding modalities with images|📄🎨🎬🔊

Multimodal VAE

A multimodal variational autoencoder (multimodal VAE) is a deep generative model designed to learn a shared latent representation from multiple data modalities, such as images, text, audio, or video, within a unified probabilistic framework. Unlike standard VAEs that model a single data distribution, multimodal VAEs aim to model the joint distribution over multiple modalities. In a typical multimodal VAE, each modality has its own encoder, while a shared latent space is used to generate all modalities through modality-specific decoders. This shared latent representation enables the model to capture cross-modal correlations and supports joint generation, cross-modal translation, and missing-modality inference. See MAE.md.

🐱‍🚀 Miscellaneous

Workshop

Any-to-Any Multimodal Learning CVPR Workshop 2026

Survey

Awesome Github Repo

Awesome-Any-to-Any-Generation
Awesome-Multimodal-Large-Language-Models
Awesome-Unified-Multimodal-Models
LLMs Meet Multimodal Generation and Editing: A Survey
Awesome-Unified-Multimodal-Models
Awesome Autoregressive Models in Vision
Awesome-Anything

general AI methods for Anything
A curated list of general AI methods for Anything: AnyObject, AnyGeneration, AnyModel, AnyTask, etc.

Interesting Works

Tools

vllm-omni
- Blog
- Public Article

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
MAE.md		MAE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Any-to-Any-Generation

📣News

🎨 Introduction

What qualifies as Any-to-Any Generation?

📕 Table of Content

🌷 Datasets

📃Papers

Any-to-Any

Any-to-X (output-centric)

Any-to-Text

Any-to-Image

Any-to-Video

X-to-Any (input-centric)

Text-to-Any

Image-to-Any

Any Alignment

Multimodal VAE

🐱‍🚀 Miscellaneous

Workshop

Survey

Awesome Github Repo

Interesting Works

Tools

⭐️ Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome-Any-to-Any-Generation

📣News

🎨 Introduction

What qualifies as Any-to-Any Generation?

📕 Table of Content

🌷 Datasets

📃Papers

Any-to-Any

Any-to-X (output-centric)

Any-to-Text

Any-to-Image

Any-to-Video

X-to-Any (input-centric)

Text-to-Any

Image-to-Any

Any Alignment

Multimodal VAE

🐱‍🚀 Miscellaneous

Workshop

Survey

Awesome Github Repo

Interesting Works

Tools

⭐️ Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages