Spaces:

mazpie
/

genrl

Running on Zero

File size: 1,520 Bytes

2d9a728

# Dataset Preparation


# Stage2——Video-language Alignment


## Pretraining

The public portion of the pre-trained dataset we use is as follows：
- [CC3M images](https://github.com/google-research-datasets/conceptual-captions)
- [CC12M images](https://github.com/google-research-datasets/conceptual-12m)
- [SBU images](https://www.cs.rice.edu/~vo9/sbucaptions/)
- [VG images](https://visualgenome.org/api/v0/api_home.html)
- [COCO images](https://cocodataset.org/#download)
- [WebVid videos](https://github.com/m-bain/webvid)
- [InternVid videos](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)

## Evaluation

For evaluation, we follow [VINDLU](https://github.com/klauscc/VindLU/) to prepare the datasets, but we **DO NOT** compress the videos and images.  We use the original data and load the JSON files. And We use the same **JSON** files provided by [VINDLU](https://drive.google.com/drive/folders/12bC7WotvwyTG4pVvYeU4iZzmBLP1-6d9). 


### Video-Text Retrieval

- [MSRVTT videos](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip)
- [MSVD videos](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/)
- [ActivityNet videos](http://activity-net.org/download.html)
- [DiDeMo videos](https://github.com/LisaAnne/LocalizingMoments)


# Stage3——VideoChat

## Pretraining

- [VideoChat-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT)


## Evaluation
### MVBench

Please refer to [MVBench](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)