Spaces:
Running
on
A10G
Evaluation Preprocessing
MSR-VTT
Download the bounding box annotations for MSR-VTT from here. This is a pickle file with a dictionary. Each dictionary element has the video id, caption, subject of the caption and a sequence of bounding boxes. These were generated using get_fg_obj.py
.
You can also download the videos from MSR-VTT from this link. The StyleGAN-v repo is used to pre-process and convert the dataset into frames.
Pre-processing
Our pre-processing pipeline is described here. We first extract the subject of the caption using Spacy. Then this subject is fed into Owl-ViT to obtain bounding boxes. If there are 0 bounding boxes corresponding to a subject, we use the next caption from the dataset. If there is atleast one bounding box, we interpolate bounding boxes for the missing frames linearly.
ssv2-ST
Similar pre-processing is done for this dataset, except that a larger OwL-ViT model is used, and the first noun chunk is extracted instead of the subject. The former significantly slows down the pre-processing. The dataset downloading is a bit complex, you need to follow the instructions here. Download the dataset and run generate_ssv2_st.py
.
Interactive Motion Control - IMC
We generate bounding boxes for this dataset using the generate_imc.py
file. The prompts are in custom_prompts.csv
and filtered_prompts.csv
.