This is the PyTorch code of the EMCL-Net.
conda create -n EMCL python=3.9
conda activate EMCL
pip install -r requirements.txt
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.htmlcd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.ptFor MSRVTT, the official data and video links can be found in link.
For the convenience, the splits and captions can be found in sharing from CLIP4Clip,
wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zipBesides, the raw videos can be found in sharing from Frozen in Time, i.e.,
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zipFor MSDC, the official data and video links can be found in link.
For convenience, we share the processed dataset in link.
https://disk.pku.edu.cn:443/link/CC02BD15907BFFF63E5AAE4BF353A202For LSMDC, the official data and video links can be found in link.
Due to license restrictions, we cannot share this dataset.
For ActivityNet Captions, the official data and video links can be found in link.
For convenience, we share the processed dataset in link.
https://disk.pku.edu.cn:443/link/83351ABDAEA4A17A5A139B799BB524ACFor DiDeMo, the official data and video links can be found in link.
For convenience, we share the processed dataset in link.
https://disk.pku.edu.cn:443/link/BBF9F5990FC4D7FD5EA9777C32901E62python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.
| Protocol | T2V R@1 | T2V R@5 | T2V R@10 | Mean R |
|---|---|---|---|---|
| EMCL-Net (2 V100 GPUs) | 47.0 | 72.6 | 83.0 | 13.6 |
| EMCL-Net (8 V100 GPUs) | 48.2 | 74.7 | 83.6 | 13.1 |
We recommend using more GPUs for better performance:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=8 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--output_dir ${OUTPUT_PATH}You can also use 2 V100 GPUs to reproduce the results in the paper:
CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=2 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--output_dir ${OUTPUT_PATH}CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=4 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 10 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path ${Anno_PATH} \
--video_path ${DATA_PATH} \
--datatype lsmdc \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--output_dir ${OUTPUT_PATH}| Protocol | T2V R@1 | T2V R@5 | T2V R@10 | Median R | Mean R | V2T R@1 | V2T R@5 | V2T R@10 | Median R | Mean R |
|---|---|---|---|---|---|---|---|---|---|---|
| EMCL-Net | 42.1 | 71.3 | 81.1 | 2.0 | 17.6 | 54.3 | 81.3 | 88.1 | 1.0 | 5.6 |
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=4 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 10 \
--epochs 20 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path ${Anno_PATH} \
--video_path ${DATA_PATH} \
--datatype msvd \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--output_dir ${OUTPUT_PATH}CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=8 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 10 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path ${Anno_PATH} \
--video_path ${DATA_PATH} \
--datatype activity \
--max_words 64 \
--max_frames 64 \
--video_framerate 1 \
--output_dir ${OUTPUT_PATH}| Protocol | T2V R@1 | T2V R@5 | T2V R@10 | Median R | Mean R | V2T R@1 | V2T R@5 | V2T R@10 | Median R | Mean R |
|---|---|---|---|---|---|---|---|---|---|---|
| EMCL-Net | 46.8 | 74.3 | 83.1 | 2.0 | 12.3 | 45.0 | 73.2 | 82.7 | 2.0 | 9.0 |
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=8 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 10 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path ${Anno_PATH} \
--video_path ${DATA_PATH} \
--datatype didemo \
--max_words 64 \
--max_frames 64 \
--video_framerate 1 \
--output_dir ${OUTPUT_PATH}