Official implementation of the paper "Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models".
Abstract: Vision-Language Models (VLMs) like CLIP have demonstrated remarkable applicability across a variety of downstream tasks, including zero-shot image classification. Recently, the use of prompts or adapters for efficient transfer learning has gained significant attention for effectively adapting to downstream tasks. However, the roles of vision and text prompts, as well as adapters in terms of generalization and transfer difficulty, have been overlooked, limiting performance on unseen tasks. In this paper, we empirically analyze how VLMs behave when using vision and text prompts, adapters, and a combination of these components, marking a novel exploration by our study. Our observations find that utilizing vision prompts for class separability and text adapters for task adaptation is crucial for adaptability and generalizability. Moreover, to improve generalization across every domain, we propose an adaptive ensemble method that effectively combines the general knowledge of VLMs with task-specific knowledge according to transfer difficulty. Upon experimenting with extensive benchmarks, our method consistently outperforms all baselines, particularly on unseen tasks, demonstrating the effectiveness of our proposed approach.
| Method | Paper | Configs | Training Scripts |
|---|---|---|---|
| APEX | preprint | link | link |
| MaPLe | CVPR 2023 | link | link |
| CoOp | IJCV 2022 | link | link |
| Co-CoOp | CVPR 2022 | link | link |
| Deep Vision Prompting | - | link | link |
| Deep Language Prompting | - | link | link |
| Independent V-L Prompting | - | link | link |
The results reported below show the accuracy for both base and novel classes across 11 recognition datasets. These results are averaged over 20 seeds and have been reproduced by us.
| Name | Base Acc. | Novel Acc. | HM | Epochs |
|---|---|---|---|---|
| CLIP | 69.34 | 74.22 | 71.70 | - |
| CLIP-Adapter | 83.23 | 70.13 | 75.64 | 50 |
| CoCoOp | 81.11 | 70.55 | 75.03 | 10 |
| MaPLe | 82.52 | 74.24 | 77.86 | 5 |
| PromptSRC | 84.36 | 75.37 | 79.39 | 20 |
| APEX (ours) | 83.99 | 76.76 | 80.04 | 15 |
For installation and other package requirements, please follow the instructions detailed in INSTALL.md.
Please follow the instructions at DATASETS.md to prepare all datasets.
Please refer to the RUN.md for detailed instructions on training, evaluating and reproducing the results using our pre-trained models.
If you use our work, please consider citing:
@article{yang2023improving,
title={Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models},
author={Yang, Yongjin and Ko, Jongwoo and Yun, Se-Young},
journal={arXiv preprint arXiv:2311.15569},
year={2023}
}Our code is based on Co-CoOp and CoOp repository and MaPLe repository. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.
