LAVILA Model Zoo

Multi-node Training

We use multi-node training on a SLURM cluster with submitit for producing the results and models in the paper. Please install submitit in your conda environment:

pip install submitit

Pre-training

Please refer to PRETRAIN.md.

Narrator

Visual Encoder	Text Decoder	METEOR	ROUGE-L	CIDEr	Pre-trained Vis. Encoder (md5)	checkpoint (md5)
TSF-B	GPT-2	0.282	0.517	0.833	download (dbcc4d)	download (68a71f)
TSF-L@HR	GPT-2 XL	0.298	0.539	0.977	download (5c69b8)	download (443263)

Ego4D val split

torchrun --nproc_per_node=1 \
    eval_narrator.py \
    --caption-top-p 0.95 --caption-temperature 0.7 \
    --eval-freq 10000 \
    --resume $CHECKPOINT

Zero-shot

	Backbone	EK-100 MIR avg. mAP	EK-100 MIR avg. nDCG	Charades-Ego mAP^	EGTEA mean acc.	EgoMCQ intra-video acc.	checkpoint
Prev. SOTA^^	TSF-B	22.1/23.3	22.1/27.9	25.2	17.6	57.2	Epoch 1, best epoch
LAVILA	TSF-B	29.7/30.9	31.5/32.0	26.8	28.9	59.9	Epoch 1^, Epoch 5
LAVILA	TSF-L	35.0/36.1	34.2/34.6	28.9	34.1	63.1	Epoch 1^, Epoch 3

^ Note that the pre-trained checkpoint to evaluate CharadesEgo is different from that to evalute other datasets. Specifically, we use the checkpoint at epoch 1 to zero-shot evaluate CharadesEgo and the checkpoint that achieves best average mAP on EK-100 MIR to evaluate other datasets, as is done in EgoVLP. Our guess is that since CharadesEgo videos (captured by head-mounted mobile cameras) are visually different from Ego4D/EPIC-Kitchens videos (captured by professional action cameras, eg GoPro), pre-training on Ego4D videos for longer will lead to some potential domain discrepancy.

^^ We use the checkpoints released by EgoVLP and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).

1. EK-100 MIR

python eval_zeroshot.py --dataset ek100_mir --root datasets/EK100/video_ht256px/ --clip-length 4 --resume $PATH

By increasing the number of frames per clip, eg --clip-length 16, you are expected to see a better performance.

2. EK-100 CLS

python eval_zeroshot.py --dataset ek100_cls --metadata-val datasets/EK100/epic-kitchens-100-annotations/EPIC_100_validation.csv  --resume $PATH

3. Charades-Ego

python eval_zeroshot.py --dataset charades_ego --metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv --root datasets/CharadesEgo/CharadesEgo_v1_480/ --clip-length 16 --sparse-sample --resume $PATH

4. EGTEA

python eval_zeroshot.py --dataset egtea --metadata-val datasets/EGTEA/test_split1.txt --root datasets/EGTEA/cropped_clips/ --clip-length 16 --clip-stride 2 --num-crops 3 --num-clips 10 --resume $PATH

5. EgoMCQ

python eval_zeroshot.py --dataset ego4d_mcq --metadata-val datasets/Ego4D/egomcq.json --root datasets/Ego4D/video_5min_chunks_288px/ --clip-length 4 --resume $PATH --use-half -j 4

Fine-tuned

EK-100 MIR

	Backbone	avg mAP	avg nDCG	Pretrain (md5)	Fine-tuned checkpoint	training log
LAVILA	TSF-B	50.5	65.0	download (d73a9c)	download	download
LAVILA	TSF-L	50.9	66.5	download (c89337)	download	download

Training and evaluating scripts

Multi-node training (Slurm)

# TimeSformer-Base
python run_with_submitit_finetune_retrieval.py \
    --pretrain-model $PATH \
    --use-checkpoint --nodes 4

# TimeSformer-Large
python run_with_submitit_finetune_retrieval.py \
    --pretrain-model $PATH \
    --batch-size 4 \ 
    --use-checkpoint --nodes 4

Single-machine training

torchrun --nproc_per_node=8 \
    main_finetune_retrieval.py \
    --output-dir $OUT_DIR \
    --pretrain-model $PATH \
    --use-checkpoint

Note that you might see a slight drop of performance when training on a single node compared to multiple nodes (everything else being the same) because of a smaller total batch size.

Evaluation

Evaluation is done every --eval-freq 5 epochs by default during fine-tuning. If you want to evaluate any checkpoint after fine-tuning, please switch to --evaluate mode and specify the path to the checkpoint by --resume $FINETUNED_CHECKPOINT.

torchrun --nproc_per_node=1 \
    main_finetune_retrieval.py \
    --output-dir $OUT_DIR \
    --pretrain-model $PATH \
    --use-checkpoint \
    --evaluate \
    --resume $FINETUNED_CHECKPOINT

CharadesEgo

	Backbone	video mAP	Pretrain^ (md5)	Fine-tuned checkpoint	training log
LAVILA	TSF-B	33.7	download (02dbb9)	download	download
LAVILA	TSF-L	36.1	download (9a25de)	download	download

^ Note that the pre-trained checkpoint for fine-tuning CharadesEgo is different from that for fine-tuning EK-100 or EGTEA. Same reason stated above.

Training and evaluating scripts

Multi-node training (Slurm)

# TimeSformer-Base
python run_with_submitit_finetune_retrieval.py \
    --dataset charades_ego \
    --metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
    --metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
    --root datasets/CharadesEgo/CharadesEgo_v1_480/ \
    --epochs 10 \
    --save-freq 1 --eval-freq 1 \
    --sparse-sample \
    --pretrain-model $PATH \
    --use-checkpoint --nodes 4

# TimeSformer-Large
python run_with_submitit_finetune_retrieval.py \
    --dataset charades_ego \
    --metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
    --metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
    --root datasets/CharadesEgo/CharadesEgo_v1_480/ \
    --epochs 10 \
    --save-freq 1 --eval-freq 1 \
    --sparse-sample \
    --pretrain-model $PATH \
    --batch-size 4 \
    --use-checkpoint --nodes 4

Evaluation

torchrun --nproc_per_node=1 \
    main_finetune_retrieval.py \
    --dataset charades_ego \
    --metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
    --metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
    --root datasets/CharadesEgo/CharadesEgo_v1_480/ \
    --output-dir $OUT_DIR \
    --sparse-sample \
    --pretrain-model $PATH \
    --evaluate \
    --resume $FINETUNED_CHECKPOINT

EK-100 CLS

	Backbone	V+N+A multi-head	Verb top-1	Noun top-1	Action top-1	Pretrain (md5)	Fine-tuned checkpoint	training log
LAVILA	TSF-B	no	67.7	56.7	46.2	download (d73a9c)	download	download
LAVILA	TSF-B	yes	69.0	58.4	46.9	download (d73a9c)	download	download
LAVILA	TSF-L	yes	72.0	62.9	51.0	download (c89337)	download	download

Training and evaluating scripts

Multi-node training (Slurm)

# TimeSformer-Base
python run_with_submitit_finetune_classification.py \
    --pretrain-model $PATH \
    --use-vn-classifier --num-classes 97 300 3806 \
    --use-sgd --wd 4e-5 --lr-multiplier-on-backbone 0.1 \
    --use-checkpoint --node 1

# TimeSformer-Large
python run_with_submitit_finetune_classification.py \
    --pretrain-model $PATH \
    --use-vn-classifier --num-classes 97 300 3806 \
    --use-sgd --wd 4e-5 --lr-multiplier-on-backbone 0.1 \
    --use-checkpoint --node 4

EGTEA

	Backbone	mean Acc.	Pretrain (md5)	Fine-tuned checkpoint	training log
LAVILA	TSF-B	70.12	download (d73a9c)	download	download
LAVILA	TSF-L	76.00	download (c89337)	download	download

Training and evaluating scripts

# TimeSformer-Base
python run_with_submitit_finetune_classification.py \
    --dataset egtea \
    --metadata-train datasets/EGTEA/train_split1.txt \
    --metadata-val datasets/EGTEA/test_split1.txt \
    --root datasets/EGTEA/cropped_clips/ \
    --pretrain-model $PATH \
    --num-classes 106 \
    --use-sgd --wd 4e-5 \
    --use-checkpoint --node 1

# TimeSformer-Large
python run_with_submitit_finetune_classification.py \
    --dataset egtea \
    --metadata-train datasets/EGTEA/train_split1.txt \
    --metadata-val datasets/EGTEA/test_split1.txt \
    --root datasets/EGTEA/cropped_clips/ \
    --pretrain-model $PATH \
    --num-classes 106 \
    --use-sgd --wd 4e-5 \
    --batch-size 4 \
    --use-checkpoint --node 4

Evaluation

torchrun --nproc_per_node=1 \
    main_finetune_classification.py \
    --dataset egtea \
    --metadata-train datasets/EGTEA/train_split1.txt \
    --metadata-val datasets/EGTEA/test_split1.txt \
    --root datasets/EGTEA/cropped_clips/ \
    --output-dir $OUT_DIR \
    --pretrain-model $PATH \
    --num-classes 106 \
    --use-sgd --wd 4e-5 \
    --evaluate \
    --resume $FINETUNED_CHECKPOINT \
    --num-crops 3 --num-clips 10 \
    --use-half