CodeBind: Decoupled Representation Learning for Multimodal Alignment

Abstract

Multimodal representation alignment is pivotal for large language models and robotics. Traditional methods are often hindered by cross-modal information discrepancies and data scarcity, leading to suboptimal alignment spaces that overlook modality-unique features. CodeBind optimizes multimodal representation spaces through a modality-shared-specific codebook design. By incrementally aligning target and bridging modalities, it bypasses the need for fully paired data. The framework decomposes features into shared components for semantic consistency and specific components for modality-unique details, then uses compositional vector quantization to align text, image, video, audio, depth, thermal, tactile, 3D point cloud, and EEG modalities.

Key Features

Decoupled representation learning

Embeddings are split into modality shared and specific components, reducing over-alignment while preserving fine-grained information.

Modality-shared-specific codebooks

A shared codebook supports cross-modal consistency, while specific codebooks keep modality-unique information from being suppressed.

Scalable multimodal alignment

CodeBind aligns target modalities through bridging modalities (text and vision) and validates the design across nine modalities for cross-modalclassification and retrieval.

Framework

Main Results

CodeBind result overview — (a) Embedding visualization of CodeBind-IB vs Imagebind in audio-video-text alignment. (b) Fine-grained image retrievel using ImageNet. (c) Linear probing of image embeddings for category-based and fine-grained-based classification. CodeBind improves both ImageBind and ViT-Lens across diverse modalities, highlighting stronger alignment across image, video, depth, audio, thermal, tactile, EEG, and 3D modalities. Our approach also preserves modality-specific information, leading to significant performance gains in intra-modal fine-grained image retrieval and linear probing.

CodeBind-IB integrated into ImageBind

	Image		Video		Depth		Audio					Thermal		Tactile			EEG
Method	IN1K	P365	K400	MSR-VTT	NYU-D	SUN-D	Audioset	VGGS	ESC	Clotho	AudioCaps	LLVIP	FLIR_v2	TAG-M	TAG-H/S	TAG-R/S	IN-EEG
ImageBind	77.7	45.4	50.5	36.1	54.0	35.1	17.6	27.8	66.9	6.0/28.4	9.3/42.3	63.4	46.6	24.2	65.7	69.8	18.4
CodeBind	79.3	55.5	54.4	37.8	59.3	45.7	21.1	30.5	71.0	6.9/28.6	13.3/53.8	95.5	97.2	42.6	83.9	78.2	33.1

CodeBind-VL integrated into ViT-Lens

	Depth		Audio					Tactile			EEG	3D
Method	NYU-D	SUN-D	Audioset	VGGS	ESC	Clotho	AudioCaps	TAG-M	TAG-H/S	TAG-R/S	IN-EEG	ModelNet40
ViT-Lens	68.5	52.2	26.7	31.7	75.9	8.1/31.2	14.4/54.9	65.8	74.7	63.8	41.8/42.7	70.6/94.4
CodeBind-VL	71.1	54.8	29.2	39.5	78.8	8.5/32.8	15.6/55.0	67.6	76.1	72.8	54.5/54.1	78.3/96.5

The two tables preserve the comparison settings for ImageBind and ViT-Lens respectively. For retrieval, Recall@1 is reported on MSR-VTT and ESC, and Recall@1/Recall@10 are reported on Clotho and AudioCaps. For classification, Acc@1 is reported on all other datasets, except AudioSet, where mAP is reported.

Downstream applications

CodeBind zero-shot cross-modal object localization and any-modal-to-image generation — CodeBind enables zero-shot cross-modal object localization and any-modal-to-image generation by seamlessly integrating diverse modalities into established vision-language and generative frameworks (e.g., GroundingDINO and Stable unCLIP) without additional training.

BibTeX

@article{chen2026codebind,
  title     = {CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook},
  author    = {Chen, Zeyu and Li, Jie and Han, Kai},
  journal   = {arXiv preprint arXiv:2605.18257},
  year      = {2026},
}

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

ACL 2026