File size: 1,520 Bytes
2d9a728
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Dataset Preparation


# Stage2——Video-language Alignment


## Pretraining

The public portion of the pre-trained dataset we use is as follows:
- [CC3M images](https://github.com/google-research-datasets/conceptual-captions)
- [CC12M images](https://github.com/google-research-datasets/conceptual-12m)
- [SBU images](https://www.cs.rice.edu/~vo9/sbucaptions/)
- [VG images](https://visualgenome.org/api/v0/api_home.html)
- [COCO images](https://cocodataset.org/#download)
- [WebVid videos](https://github.com/m-bain/webvid)
- [InternVid videos](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)

## Evaluation

For evaluation, we follow [VINDLU](https://github.com/klauscc/VindLU/) to prepare the datasets, but we **DO NOT** compress the videos and images.  We use the original data and load the JSON files. And We use the same **JSON** files provided by [VINDLU](https://drive.google.com/drive/folders/12bC7WotvwyTG4pVvYeU4iZzmBLP1-6d9). 


### Video-Text Retrieval

- [MSRVTT videos](https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip)
- [MSVD videos](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/)
- [ActivityNet videos](http://activity-net.org/download.html)
- [DiDeMo videos](https://github.com/LisaAnne/LocalizingMoments)


# Stage3——VideoChat

## Pretraining

- [VideoChat-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT)


## Evaluation
### MVBench

Please refer to [MVBench](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)