Evaluation Preprocessing

MSR-VTT

Download the bounding box annotations for MSR-VTT from here. This is a pickle file with a dictionary. Each dictionary element has the video id, caption, subject of the caption and a sequence of bounding boxes. These were generated using get_fg_obj.py. You can also download the videos from MSR-VTT from this link. The StyleGAN-v repo is used to pre-process and convert the dataset into frames.

Pre-processing

Our pre-processing pipeline is described here. We first extract the subject of the caption using Spacy. Then this subject is fed into Owl-ViT to obtain bounding boxes. If there are 0 bounding boxes corresponding to a subject, we use the next caption from the dataset. If there is atleast one bounding box, we interpolate bounding boxes for the missing frames linearly.

ssv2-ST

Similar pre-processing is done for this dataset, except that a larger OwL-ViT model is used, and the first noun chunk is extracted instead of the subject. The former significantly slows down the pre-processing. The dataset downloading is a bit complex, you need to follow the instructions here. Download the dataset and run generate_ssv2_st.py.

Interactive Motion Control - IMC

We generate bounding boxes for this dataset using the generate_imc.py file. The prompts are in custom_prompts.csv and filtered_prompts.csv.