WEBing commited on
Commit
7d6a6dd
1 Parent(s): 74a7ff2

update readme.md

Browse files
Files changed (1) hide show
  1. README.md +28 -8
README.md CHANGED
@@ -7,11 +7,30 @@ pipeline_tag: visual-question-answering
7
  ---
8
  # Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
9
 
10
- # Release
11
  - [2024/07/17] 🔥 **Kangaroo** has been released. We release [blog](https://kangaroogroup.github.io/Kangaroo.github.io/) and [model](https://huggingface.co/KangarooGroup/kangaroo). Please check out the blog for details.
12
 
 
 
 
 
 
 
 
 
 
13
 
14
- # Get Started with the Model
 
 
 
 
 
 
 
 
 
 
15
  ```python
16
  import torch
17
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -25,7 +44,8 @@ model = AutoModelForCausalLM.from_pretrained(
25
  model = model.to("cuda")
26
  terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
27
 
28
- video_path = "/path/to/video"
 
29
  query = "Please describe this video"
30
  out, history = model.chat(video_path=video_path,
31
  query=query,
@@ -35,8 +55,9 @@ out, history = model.chat(video_path=video_path,
35
  do_sample=True,
36
  temperature=0.6,
37
  top_p=0.9,)
38
- print(out)
39
 
 
40
  query = "What happend at the end of the video?"
41
  out, history = model.chat(video_path=video_path,
42
  query=query,
@@ -47,19 +68,18 @@ out, history = model.chat(video_path=video_path,
47
  do_sample=True,
48
  temperature=0.6,
49
  top_p=0.9,)
50
- print(out)
51
  ```
52
 
53
- # Citation
54
 
55
  If you find it useful for your research , please cite related papers/blogs using this BibTeX:
56
  ```bibtex
57
 
58
- @misc{liu24kangaroo,
59
  title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
60
  url={https://kangaroogroup.github.io/Kangaroo.github.io/},
61
  author={Jiajun Liu and Yibing Wang and Hanghang Ma and Xiaoping Wu and Xiaoqi Ma and Jie Hu},
62
  month={July},
63
  year={2024}
64
  }
65
- ```
 
7
  ---
8
  # Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
9
 
10
+ ## Release
11
  - [2024/07/17] 🔥 **Kangaroo** has been released. We release [blog](https://kangaroogroup.github.io/Kangaroo.github.io/) and [model](https://huggingface.co/KangarooGroup/kangaroo). Please check out the blog for details.
12
 
13
+ ## Abstract
14
+ We introduce <strong>Kangaroo</strong>, a powerful Multimodal Large Language Model designed for long-context video understanding. Our presented Kangaroo model shows remarkable performance across diverse video understanding tasks including video caption, QA and conversation. Generally, our key contributions in this work can be summarized as follows:
15
+ <ol>
16
+ <li><strong>Long-context Video Input.</strong> To enhance the model's capability to comprehend longer videos, we extend the maximum frames of input videos to 160. To this end, we aggregate multiple videos with variable frame counts and aspect ratios into one sample. We further design a spatial-temporal pathify module to improve training efficiency.</li>
17
+ <li><strong>Strong Performance.</strong> We evaluate our model across various video understanding benchmarks. The results indicate that our model achieves state-of-the-art performance on the majority of comprehensive benchmarks and maintain a competitive level in others. Notably, our model outperforms most larger open-source models with over 30B parameters and some proprietary models on certain benchmarks.</li>
18
+ <li><strong>Video Annotation System.</strong> We develop a data curation and automatic annotation system to generate captions for open-source and internal videos. The generated large-scale dataset are utilized for video-text pre-training. For video instruction tuning stage, we construct a video instruciton tuning dataset based on public and internal datasets covering a variety of tasks.</li>
19
+ <li><strong>Billingual Conversation.</strong> Our proposed model is equipped with the capability of Chinese, English and billingual conversations, and support single/multi-round conversation paradigms.
20
+ </li>
21
+ </ol>
22
 
23
+
24
+ ## Quick Start
25
+
26
+ ### Requirements
27
+ - python == 3.9
28
+ - torch == 2.1.0, torchvision == 0.16.0
29
+ - CUDA == 12.1 (for GPU)
30
+ - transformers == 4.41.0
31
+ - xformers == 0.0.23
32
+
33
+ ### Multi-round Chat with 🤗 Transformers
34
  ```python
35
  import torch
36
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
44
  model = model.to("cuda")
45
  terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
46
 
47
+ # Round 1
48
+ video_path = "path/to/video"
49
  query = "Please describe this video"
50
  out, history = model.chat(video_path=video_path,
51
  query=query,
 
55
  do_sample=True,
56
  temperature=0.6,
57
  top_p=0.9,)
58
+ print('Assistant: ', out)
59
 
60
+ # Round 2
61
  query = "What happend at the end of the video?"
62
  out, history = model.chat(video_path=video_path,
63
  query=query,
 
68
  do_sample=True,
69
  temperature=0.6,
70
  top_p=0.9,)
71
+ print('Assistant: ', out)
72
  ```
73
 
74
+ ## Citation
75
 
76
  If you find it useful for your research , please cite related papers/blogs using this BibTeX:
77
  ```bibtex
78
 
79
+ @misc{kangaroogroup,
80
  title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
81
  url={https://kangaroogroup.github.io/Kangaroo.github.io/},
82
  author={Jiajun Liu and Yibing Wang and Hanghang Ma and Xiaoping Wu and Xiaoqi Ma and Jie Hu},
83
  month={July},
84
  year={2024}
85
  }