omni-research commited on
Commit
b672c01
1 Parent(s): 1eca601

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -51
README.md CHANGED
@@ -1,51 +1,44 @@
1
- ---
2
- license: llama2
3
- ---
4
-
5
- # Tarsier Model Card
6
- ## Model details
7
- **Model type:**
8
- Tarsier-7b is one of the Tarsier family -- an open-source large-scale video-language models, which is designed to generate high-quality video descriptions, together with good capability of general video understanding (Tarsier-34b gains SOTA results on 6 open benchmarks). Base LLM: [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b)
9
-
10
- **Model date:**
11
- Tarsier-7b was trained in June 2024.
12
-
13
- **Paper or resources for more information:**
14
- - github repo: https://github.com/bytedance/tarsier
15
- - paper link: https://arxiv.org/abs/2407.00634
16
-
17
- ## License
18
- lmsys/vicuna-7b-v1.5 license.
19
-
20
- **Where to send questions or comments about the model:**
21
- https://github.com/bytedance/tarsier/issues
22
-
23
- ## Intended use
24
- **Primary intended uses:**
25
- The primary use of Tarsier is research on large multimodal models, especially video description.
26
-
27
- **Primary intended users:**
28
- The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
29
-
30
- ## Training dataset
31
- Tarsier tasks a two-stage training strategy.
32
- 1. Stage-1: Multi-task Pre-training
33
-
34
- In stage-1, we trained our model across:
35
- - 10M diverse public datasets, such as video captioning, video question answering, action recognition, multi-image understanding, and text generation.
36
- - 3.5M in-house data, including 2.4M high-quality video caption data similar to WebVid and 1.1M videos with object-tracking (processed on videos from Webvid and HD-VILA by object tracking tool: [DEVA](https://github.com/hkchengrex/Tracking-Anything-with-DEVA))
37
- 2. Stage-2: Multi-grained Instruction Tuning
38
-
39
- In stage-2, we use 500K of in-house instruction tuning data, including:
40
- - Movie clips featuring multiple shots, subjects, or events, and had annotators provide descriptions varying in length and detail, from brief motion summaries to comprehensive narratives of visual details.
41
- - A dataset rich in camera motions, including zooming, translating, panning, and rotating.
42
- - Video-aware creative writing, such as poems, dialogues, speeches.
43
-
44
- ## Evaluation dataset
45
- - A challenging video desription dataset: [DREAM-1K](https://huggingface.co/datasets/omni-research/DREAM-1K)
46
- - Multi-choice VQA: [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [NeXT-QA](https://github.com/doc-doc/NExT-QA) and [Egoschema](https://drive.google.com/drive/folders/1SS0VVz8rML1e5gWq7D7VtP1oxE2UtmhQ)
47
- - Open-ended VQA: [MSVD-QA](https://opendatalab.com/OpenDataLab/MSVD), [MSR-VTT-QA](https://opendatalab.com/OpenDataLab/MSR-VTT), [ActivityNet-QA](https://github.com/MILVLG/activitynet-qa) and [TGIF-QA](https://opendatalab.com/OpenDataLab/TGIF-QA)
48
- - Video Caption: [MSVD-Caption](https://opendatalab.com/OpenDataLab/MSVD), [MSRVTT-Caption](https://opendatalab.com/OpenDataLab/MSR-VTT), [VATEX](https://eric-xw.github.io/vatex-website/about.html)
49
-
50
- ## How to Use
51
- see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage
 
1
+ ---
2
+ license: llama2
3
+ ---
4
+
5
+ # Tarsier Model Card
6
+ ## Model details
7
+ **Model type:**
8
+ Tarsier-7b is one of the Tarsier family -- an open-source large-scale video-language models, which is designed to generate high-quality video descriptions, together with good capability of general video understanding (Tarsier-34b gains SOTA results on 6 open benchmarks). Base LLM: [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b)
9
+
10
+ **Model date:**
11
+ Tarsier-7b was trained in June 2024.
12
+
13
+ **Paper or resources for more information:**
14
+ - github repo: https://github.com/bytedance/tarsier
15
+ - paper link: https://arxiv.org/abs/2407.00634
16
+
17
+ ## License
18
+ lmsys/vicuna-7b-v1.5 license.
19
+
20
+ **Where to send questions or comments about the model:**
21
+ https://github.com/bytedance/tarsier/issues
22
+
23
+ ## Intended use
24
+ **Primary intended uses:**
25
+ The primary use of Tarsier is research on large multimodal models, especially video description.
26
+
27
+ **Primary intended users:**
28
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
29
+
30
+ ## Training dataset
31
+ Tarsier tasks a two-stage training strategy.
32
+ - Stage-1: Multi-task Pre-training on 13M data
33
+ - Stage-2: Multi-grained Instruction Tuning on 500K data
34
+
35
+ In both stages, we freeze ViT and train all the parameters of projection layer and LLM.
36
+
37
+ ## Evaluation dataset
38
+ - A challenging video desription dataset: [DREAM-1K](https://huggingface.co/datasets/omni-research/DREAM-1K)
39
+ - Multi-choice VQA: [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [NeXT-QA](https://github.com/doc-doc/NExT-QA) and [Egoschema](https://drive.google.com/drive/folders/1SS0VVz8rML1e5gWq7D7VtP1oxE2UtmhQ)
40
+ - Open-ended VQA: [MSVD-QA](https://opendatalab.com/OpenDataLab/MSVD), [MSR-VTT-QA](https://opendatalab.com/OpenDataLab/MSR-VTT), [ActivityNet-QA](https://github.com/MILVLG/activitynet-qa) and [TGIF-QA](https://opendatalab.com/OpenDataLab/TGIF-QA)
41
+ - Video Caption: [MSVD-Caption](https://opendatalab.com/OpenDataLab/MSVD), [MSRVTT-Caption](https://opendatalab.com/OpenDataLab/MSR-VTT), [VATEX](https://eric-xw.github.io/vatex-website/about.html)
42
+
43
+ ## How to Use
44
+ see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage