The folder includes the implementation of LLaVolta for Efficient Large Language and Vision Assistant.
@inproceedings{chen2024efficient,
title={Efficient large multi-modal models via visual context compression},
author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}
Note: code is developed based on Ubuntu 20.04/22.04. CUDA=12.1 Our code is developed based on LLaVA, the installation is very similar to original repo of LLaVA:
- Clone this repository and navigate to LLaVA folder
git clone https://github.com/Beckschen/LLaVolta
cd LLaVolta- Install Package
conda create -n llavolta python=3.10 -y
conda activate llavolta
pip install --upgrade pip
pip install -e .- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
cd llava/eval
tar xvf table.tar
cd ../..- Download the training data for both pretraining and fine-tuning from the original LLaVA repository.
- Set the necessary path variables:
ROOT_DATA,ROOT_WEIGHT, andROOT_LOG(optional). - Begin training using the scripts. We provide four examples: 4stage, heavy_compression, light_compression, and reproduce.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/train-$NAME.shRunning scripts under scripts/v1_5/eval/$NAME, where NAME is the name of checkpoint's name. We provide four example: 4stage, heavy_compression, light_compression, reproduce.
For all scripts we provided, please first fill up necessary path variables: ROOT_DATA, ROOT_WEIGHT, ROOT_LOG(optional)
- Download
test2015and put it under$ROOT_DATA/eval/vqav2. - Multi-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/vqav2.sh- Submit the results to the evaluation server.
- Download the data and evaluation scripts following the official instructions and put under
$ROOT_DATA/eval/gqa/data. You may need to modifyeval.pyas this due to the missing assets in the GQA v1.2 release. - Multi-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/gqa.sh- Download
test.jsonand extracttest.ziptotest. Put them under$ROOT_DATA/eval/vizwiz. - Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/vizwiz.sh- Submit the results to the evaluation server:
$ROOT_DATA/eval/vizwiz/answers_upload.
- Under
$ROOT_DATA/eval/scienceqa, downloadimages,pid_splits.json,problems.jsonfrom thedata/scienceqafolder of the ScienceQA repo. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/sqa.sh- Download
TextVQA_0.5.1_val.jsonand images and extract to$ROOT_DATA/eval/textvqa. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/textvqa.sh- Download
cocofrom POPE and put under$ROOT_DATA/eval/pope. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/pope.sh- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version. - put the official
eval_toolandMME_Benchmark_release_versionunder$ROOT_DATA/eval/MME. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mme.sh- Download
mmbench_dev_20230712.tsvand put under$ROOT_DATA/eval/mmbench. - Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmbench.sh- Submit the results to the evaluation server:
$ROOT_DATA/eval/mmbench/answers_upload/mmbench_dev_20230712.
- Download
mmbench_dev_cn_20231003.tsvand put under$ROOT_DATA/eval/mmbench. - Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmbench_cn.sh- Submit the results to the evaluation server:
$ROOT_DATA/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.
- Following the official instructions to download the images and the videos. Put images under
$DATA_ROOT/eval/seed_bench/SEED-Bench-image. Note that we only use image subset to evaluate LLaVolta - Multiple-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/seed.sh- Extract contents of
llava-bench-in-the-wildto$ROOT_DATA/eval/llava-bench-in-the-wild. - Single-GPU inference and evaluate.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/llavabench.sh- Extract
mm-vet.zipto$ROOT_DATA/eval/mmvet. - Single-GPU inference.
NAME=4stage # Option: {heavy-compression, light-compression, reproduce}
bash scripts/v1_5/eval/$NAME/mmvet.sh- Evaluate the predictions in
$ROOT_DATA/eval/mmvet/resultsusing the official jupyter notebook.
@inproceedings{chen2024efficient,
title={Efficient large multi-modal models via visual context compression},
author={Chen, Jieneng and Ye, Luoxin and He, Ju and Wang, Zhao-Yang and Khashabi, Daniel and Yuille, Alan},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}Luoxin Ye (@feiyu12138) is the primary contributor to the codebase. We have archived the project here, in order to maintain a clean and organized code style.
