Skip to content

any2any-mllm/awesome-any2any

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-Any-to-Any-Generation

📣News

✨✨✨We are organizing a CVPR26 Workshop on Any-to-Any Multimodal Learning, welcome submission.

🎨 Introduction

Traditional generative models are typically designed for a fixed input–output modality pair (e.g., text-to-image or image-to-text). However, real-world multimodal intelligence requires the ability to flexibly generate across arbitrary modality combinations, including multi-input and multi-output settings.

This repository aims to systematize Any-to-Any Multimodal Intelligences, where models can accept inputs from arbitrary modalities and produce outputs in arbitrary modalities within a unified framework.

Image

What qualifies as Any-to-Any Generation?

A model/system is considered Any-to-Any if it satisfies at least one of the following:

  1. Supports arbitrary combinations of input modalities and output modalities within a single unified framework;
  2. Enables multi-input and/or multi-output generation without task-specific retraining;
  3. Relies on a modality-agnostic intermediate representation (e.g., shared latent space, discrete tokens, structured programs);
  4. Demonstrates compositional generalization to unseen modality mappings.

📕 Table of Content

🌷 Datasets

📃Papers

Any-to-Any

Any-to-Any generation refers to unified systems that can take inputs from multiple modalities (e.g., text/image/video/audio) and produce outputs in multiple modalities within a single framework.

Any-to-X (output-centric)

Any-to-X methods accept flexible inputs (potentially multi-modal, such as text + image + audio) but generate a single target modality. This setting is often practically useful (e.g., “any condition → text report”, “any condition → image synthesis”, “any condition → video generation”), and it highlights how systems fuse heterogeneous conditions and maintain faithfulness to each input. Compared to fully general Any-to-Any systems, Any-to-X typically has a simpler decoding interface, but still demands strong cross-modal alignment and robust conditioning mechanisms.

Any-to-Text

Any-to-Text focuses on producing textual outputs (captioning, explanation, dialogue, reasoning traces, instruction-following) from arbitrary visual/audio/3D/video inputs.

Any-to-Image

Any-to-Image methods generate images conditioned on diverse inputs beyond text, such as images, sketches, poses, layouts, audio cues, or multi-modal prompts.

Any-to-Video

Any-to-Video targets video generation from flexible conditions (text/image/video/audio/trajectory/layout).

X-to-Any (input-centric)

X-to-Any methods start from a fixed input modality but aim to generate multiple output modalities (e.g., text → image/video/audio; image → text/video/audio). This setting is useful for studying whether a model learns a shared multimodal representation that can be decoded into different modalities. Compared to Any-to-X, the emphasis is on multi-head decoding and output diversity, often requiring modality-specific decoders while sharing a common backbone or latent space.

Text-to-Any

Text-to-Any expands classic text-to-image into text-conditioned generation across multiple modalities, such as video, audio, music, speech, and even structured outputs. Typical solutions include unified diffusion/flow backbones, discrete token modeling, or LLM-centered generation that routes to modality experts.

Image-to-Any

Image-to-Any aims to generate other modalities from visual input, such as image → text (captioning/VQA), image → video (animation), image → audio (foley/sound), or image → 3D (reconstruction). The main technical challenge is learning mappings from static visual cues to modalities with missing dimensions (e.g., time, sound source, geometry), which often requires strong priors, world knowledge, or intermediate structured representations.

Any Alignment

Multimodal VAE

A multimodal variational autoencoder (multimodal VAE) is a deep generative model designed to learn a shared latent representation from multiple data modalities, such as images, text, audio, or video, within a unified probabilistic framework. Unlike standard VAEs that model a single data distribution, multimodal VAEs aim to model the joint distribution over multiple modalities. In a typical multimodal VAE, each modality has its own encoder, while a shared latent space is used to generate all modalities through modality-specific decoders. This shared latent representation enables the model to capture cross-modal correlations and supports joint generation, cross-modal translation, and missing-modality inference. See MAE.md.


🐱‍🚀 Miscellaneous

Workshop

Survey

Awesome Github Repo

Interesting Works

Tools

⭐️ Star History

Star History Chart

About

This is a repository for awesome any2any work collection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors