TripletCLIP :

Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Arizona State University; University of Maryland, Baltimore County

NeurIPS 2024

Image

Dataset

Image

We generate a synthetic dataset to counter the lack of compositional diversity in CC3M and CC12M by complimenting the dataset with hard negative captions and corresppnding negative images.

Performance

Image

Composition evaluations of the methods on SugarCrepe benchmark.



Image

Zero-shot image-text retrieval and classification results.



Image

Ablation on filtering high-quality image-text pairs from TripletData.



Image

What's holding back the CLIP models? Ablation w.r.t. frozen modality encoders.

Relevant Projects

Image

ECLIPSE (CVPR'24)

A Resource-Efficient Text-to-Image Prior for Image Generations

Image

WOUAF (CVPR'24)

Weight Modulation for User Attribution and Fingerprinting in T2I Models.