✨✨✨We are organizing a CVPR26 Workshop on Any-to-Any Multimodal Learning, welcome submission.
Traditional generative models are typically designed for a fixed input–output modality pair (e.g., text-to-image or image-to-text). However, real-world multimodal intelligence requires the ability to flexibly generate across arbitrary modality combinations, including multi-input and multi-output settings.
This repository aims to systematize Any-to-Any Multimodal Intelligences, where models can accept inputs from arbitrary modalities and produce outputs in arbitrary modalities within a unified framework.
A model/system is considered Any-to-Any if it satisfies at least one of the following:
- Supports arbitrary combinations of input modalities and output modalities within a single unified framework;
- Enables multi-input and/or multi-output generation without task-specific retraining;
- Relies on a modality-agnostic intermediate representation (e.g., shared latent space, discrete tokens, structured programs);
- Demonstrates compositional generalization to unseen modality mappings.
-
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
📄🎨🔊🎶🧊... -
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
-
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
-
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
-
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
-
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
📄🎨 -
Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
📄🎨 -
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
📄🎨 -
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
📄🎨 -
InterleavedBench: Holistic Evaluation for Interleaved Text-and-Image Generation
📄🎨 -
OmniBench: Towards The Future of Universal Omni-Language Models
📄🎨🔊🎶 -
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
📄🎬🔊 -
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
📄🎬🔊
Any-to-Any generation refers to unified systems that can take inputs from multiple modalities (e.g., text/image/video/audio) and produce outputs in multiple modalities within a single framework.
-
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
🏷️:Agentic System|📄🎨🔊🎶🧊... -
OmniGAIA: Towards Native Omni-Modal AI Agents
🏷️:AR|📄🎨🎤🎬 -
Beyond Language Modeling: An Exploration of Multimodal Pretraining
🏷️:AR|📄🎨 -
AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation
🏷️:llm|AR|📄🎨🎤 -
STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning
🏷️:llm|diffusion|📄🎨 -
Symbolic Representation for Any-to-Any Generative Tasks
🏷️:llm|diffusion|📄🎬🎨🧊 -
Easy, fast, and cheap omni-modality model serving for everyone
🏷️:mllm|Talker|📄🎬🎨🔊 -
AToken: A Unified Tokenizer for Vision
🏷️:*Unified Vision tokenizer|🎬🎨🧊 -
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
🏷️:*llm|moe|📄🎬🎨🔊🎤 -
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
🏷️:*llm|MMDiT|📄🎬🎨🔊🎤 -
Ming-Omni: A Unified Multimodal Model for Perception and Generation
🏷️:*Ling|flow|📄🎬🎨🔊🎤 -
Qwen2.5-Omni Technical Report
🏷️:*llm|flow|📄🎬🎨🔊🎶🎤 -
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
🏷️:*mllm|📄🎬🎨🎶🎤 -
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech, and Multimodal Live Streaming on Your Phone
🏷️:*mllm|📄🎬🎨🎤 -
Baichuan-Omni-1.5 Technical Report
🏷️:*mllm|📄🎬🎨🎤 -
Show-o2: Improved Native Unified Multimodal Models
🏷️:llm|flow|📄🎬🎨 -
Baichuan-Omni-1.5 Technical Report
🏷️:llm|audio decoder|📄🎬🎨🎤 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
🏷️:llm|diffusion|📄🎬🎨 -
CoDi2: In-Context, Interleaved, and Interactive Any-to-Any Generation
🏷️:llm|diffusion|📄🎬🎨🔊 -
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
🏷️:transformer encoder-decoder|📄🎬🎨🔊🤖 -
C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
🏷️:diffusion|📄🎨🔊 -
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
🏷️:diffusion|🎬🎨🔊 -
ModaVerse: Efficiently Transforming Modalities with LLMs
🏷️:llm|diffusion|📄🎬🎨🔊 -
NExT-GPT: Any-to-Any Multimodal LLM
🏷️:llm|diffusion|📄🎬🎨🔊 -
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
🏷️:llm|tokenizer|📄🎨🎶🎤 -
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
🏷️:transformer encoder-decoder|📄🎨 -
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
🏷️:masked modeling|transformer encoder-decoder|📄🎨 -
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems
🏷️:agent|📄🎬🎨 -
X-VILA: Cross-Modality Alignment for Large Language Model
🏷️:llm|diffusion|📄🎬🎨🔊 -
C3LLM: Conditional Multimodal Content Generation Using Large Language Models
🏷️:transformer encoder-decoder|📄🎬🔊 -
M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
🏷️:llm|📄🎬🎨🔊🎶 -
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
🏷️:mllm|📄🎬🎨🔊🎤 -
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
🏷️:mllm|📄🎬🎨🔊🎶 -
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
🏷️:transformer encoder-decoder|📄🎨🔊 -
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
🏷️:llm|📄🎬🔊🎤🎶 -
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
🏷️:llm|📄🎬🎨🔊🎤 -
CoDi: Any-to-Any Generation via Composable Diffusion
🏷️:diffusion|📄🎬🎨🔊 -
4M: Massively Multimodal Masked Modeling
🏷️:masked modeling|transformer encoder-decoder|📄🎨
Any-to-X methods accept flexible inputs (potentially multi-modal, such as text + image + audio) but generate a single target modality. This setting is often practically useful (e.g., “any condition → text report”, “any condition → image synthesis”, “any condition → video generation”), and it highlights how systems fuse heterogeneous conditions and maintain faithfulness to each input. Compared to fully general Any-to-Any systems, Any-to-X typically has a simpler decoding interface, but still demands strong cross-modal alignment and robust conditioning mechanisms.
Any-to-Text focuses on producing textual outputs (captioning, explanation, dialogue, reasoning traces, instruction-following) from arbitrary visual/audio/3D/video inputs.
-
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
🏷️:Qwen2.5-Omni|📄🎬🎨🔊 -
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
🏷️:Qwen2.5-VL|📄🎬🎨🔊 -
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
🏷️:Qwen2.5-Omni-7B-thinker|📄🎬🎤 -
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
🏷️:Qwen2.5-Omni-7B|📄🎬🎨🔊🎤 -
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
🏷️:mllm|📄🎨🎤 -
A Reason-then-Describe Instruction Interpreter for Controllable Video Generation
🏷️:Qwen2.5-Omni-7B|📄🎬🎨🎥🏃🏻🔊🎤 -
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
🏷️:Qwen2.5-Omni-3/7B|📄🎬🔊🎤 -
EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
🏷️:Qwen2.5-Omni-3/7B|📄🎬🔊🎤 -
Ola: Pushing the Frontiers of Omni-Modal Language Model
🏷️:Qwen-2.5-7B|📄🎬🎨🔊🎤 -
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
🏷️:Qwen2.5-VL/Qwen2-Audio|📄🎬🎨🔊🎤 -
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
🏷️:Qwen2.5-VL|📄🎬🎨🎥🏃🏻 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
🏷️:mllm|📄🎬🎨 -
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
🏷️:mllm|📄🎬🎨 -
EMU: GENERATIVE PRETRAINING IN MULTIMODALITY
🏷️:mllm|📄🎬🎨 -
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
🏷️:llm|moe|📄🎬🎨🔊🎤 -
X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning
🏷️:llm|📄🎬🎨🔊🧊 -
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
🏷️:llm|📄🎬🎨 -
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
🏷️:llm|📄🎬🎨🔊 -
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
🏷️:llm|📄🎨🔊 -
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
🏷️:llm|📄🎬🎨🔊 -
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
🏷️:llm|📄🎬🎨🔊 -
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
🏷️:llm|modality alignment|📄🎬🎨🔊
Any-to-Image methods generate images conditioned on diverse inputs beyond text, such as images, sketches, poses, layouts, audio cues, or multi-modal prompts.
-
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
-
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
Any-to-Video targets video generation from flexible conditions (text/image/video/audio/trajectory/layout).
-
OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning
Details
Text / First-frame / Key-frame / Video / Reference / Compositional Multi-image / Text-Image / Reasoning-augmented to video generation -
Unified multimodal video-audio joint generation framework
Seedance 2.0 significantly enhances its multimodal processing capabilities, supporting highly diverse and flexible mixed-modality inputs. Users can provide up to nine images, three video clips, three audio segments, along with natural language instructions simultaneously. This design enables the model to draw from multiple reference sources within a single creative task, rather than being limited to a single image or text prompt. -
Videopoet:A large language model for zero-shot video generation
-
VideoComposer: Compositional Video Synthesis with Motion Controllability
X-to-Any methods start from a fixed input modality but aim to generate multiple output modalities (e.g., text → image/video/audio; image → text/video/audio). This setting is useful for studying whether a model learns a shared multimodal representation that can be decoded into different modalities. Compared to Any-to-X, the emphasis is on multi-head decoding and output diversity, often requiring modality-specific decoders while sharing a common backbone or latent space.
Text-to-Any expands classic text-to-image into text-conditioned generation across multiple modalities, such as video, audio, music, speech, and even structured outputs. Typical solutions include unified diffusion/flow backbones, discrete token modeling, or LLM-centered generation that routes to modality experts.
- Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
🏷️:Diffusion|🎬🎨🔊🎶🎤
Image-to-Any aims to generate other modalities from visual input, such as image → text (captioning/VQA), image → video (animation), image → audio (foley/sound), or image → 3D (reconstruction). The main technical challenge is learning mappings from static visual cues to modalities with missing dimensions (e.g., time, sound source, geometry), which often requires strong priors, world knowledge, or intermediate structured representations.
-
The Platonic Representation Hypothesis
Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.
Conventionally, different AI systems represent the world in different ways. A vision system might represent shapes and colors, a language model might focus on syntax and semantics. However, in recent years, the architectures and objectives for modeling images and text, and many other signals, are becoming remarkably alike. Are the internal representations in these systems also converging? -
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
🏷️:Binding modalities with languages|📄🎨🎬🔊 -
Meta-Transformer: A Unified Framework for Multimodal Learning
🏷️:Binding modalities with unified representations|📄🎨🔊🧊 -
ImageBind: One Embedding Space To Bind Them All
🏷️:Binding modalities with images|📄🎨🎬🔊
A multimodal variational autoencoder (multimodal VAE) is a deep generative model designed to learn a shared latent representation from multiple data modalities, such as images, text, audio, or video, within a unified probabilistic framework. Unlike standard VAEs that model a single data distribution, multimodal VAEs aim to model the joint distribution over multiple modalities. In a typical multimodal VAE, each modality has its own encoder, while a shared latent space is used to generate all modalities through modality-specific decoders. This shared latent representation enables the model to capture cross-modal correlations and supports joint generation, cross-modal translation, and missing-modality inference. See MAE.md.
-
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
-
Multimodal Latent Language Modeling with Next-Token Diffusion
-
Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders
-
MMVAE+: ENHANCING THE GENERATIVE QUALITY OF MULTIMODAL VAES WITHOUT COMPROMISES
-
Private-Shared Disentangled Multimodal VAE for Learning of Hybrid Latent Representations
-
Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models
-
Multimodal Generative Models for Scalable Weakly-Supervised Learning
-
Unified multimodal understanding and generation models: Advances, challenges, and opportunities
-
A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges
-
On path to multimodal generalist: General-level and general-bench
-
MM-LLMs: Recent Advances in MultiModal Large Language Models
-
Brain-Conditional Multimodal Synthesis: A Survey and Taxonomy
-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
-
general AI methods for Anything
A curated list of general AI methods for Anything: AnyObject, AnyGeneration, AnyModel, AnyTask, etc.
