Skip to content

YangYongJin/APEX

Repository files navigation

APEX: text Adapter, visual Prompt, and adaptive Ensemble for cross(X)-modality

Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models

paper

Official implementation of the paper "Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models".


Highlights

main figure

Abstract: Vision-Language Models (VLMs) like CLIP have demonstrated remarkable applicability across a variety of downstream tasks, including zero-shot image classification. Recently, the use of prompts or adapters for efficient transfer learning has gained significant attention for effectively adapting to downstream tasks. However, the roles of vision and text prompts, as well as adapters in terms of generalization and transfer difficulty, have been overlooked, limiting performance on unseen tasks. In this paper, we empirically analyze how VLMs behave when using vision and text prompts, adapters, and a combination of these components, marking a novel exploration by our study. Our observations find that utilizing vision prompts for class separability and text adapters for task adaptation is crucial for adaptability and generalizability. Moreover, to improve generalization across every domain, we propose an adaptive ensemble method that effectively combines the general knowledge of VLMs with task-specific knowledge according to transfer difficulty. Upon experimenting with extensive benchmarks, our method consistently outperforms all baselines, particularly on unseen tasks, demonstrating the effectiveness of our proposed approach.

☑️ Supported Methods

Method Paper Configs Training Scripts
APEX preprint link link
MaPLe CVPR 2023 link link
CoOp IJCV 2022 link link
Co-CoOp CVPR 2022 link link
Deep Vision Prompting - link link
Deep Language Prompting - link link
Independent V-L Prompting - link link

Results

MaPLe in comparison with existing methods

The results reported below show the accuracy for both base and novel classes across 11 recognition datasets. These results are averaged over 20 seeds and have been reproduced by us.

Name Base Acc. Novel Acc. HM Epochs
CLIP 69.34 74.22 71.70 -
CLIP-Adapter 83.23 70.13 75.64 50
CoCoOp 81.11 70.55 75.03 10
MaPLe 82.52 74.24 77.86 5
PromptSRC 84.36 75.37 79.39 20
APEX (ours) 83.99 76.76 80.04 15

Installation

For installation and other package requirements, please follow the instructions detailed in INSTALL.md.

Data preparation

Please follow the instructions at DATASETS.md to prepare all datasets.

Training and Evaluation

Please refer to the RUN.md for detailed instructions on training, evaluating and reproducing the results using our pre-trained models.


Citation

If you use our work, please consider citing:

@article{yang2023improving,
  title={Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models},
  author={Yang, Yongjin and Ko, Jongwoo and Yun, Se-Young},
  journal={arXiv preprint arXiv:2311.15569},
  year={2023}
}

Acknowledgements

Our code is based on Co-CoOp and CoOp repository and MaPLe repository. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.

About

Official Implementation of APEX

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors