File size: 4,077 Bytes
7301ccc 2d82477 388bd8b 8504fe2 388bd8b 7d6a6dd 388bd8b 7d6a6dd 88e47d9 a407a33 85d6546 7d6a6dd 0a3336d 3da7b6a 0a3336d 3da7b6a 0a3336d c0ccd82 cc4b78f 0a3336d 7d6a6dd 9eeb018 0a3336d cc4b78f 0a3336d 7d6a6dd 9eeb018 0a3336d bd702ad 0a3336d cc4b78f 0a3336d 7d6a6dd 388bd8b 7d6a6dd 388bd8b 2d82477 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
---
license: apache-2.0
language:
- en
- zh
pipeline_tag: video-text-to-text
---
# Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
**Kangaroo** has been released. Please check out our [paper](https://arxiv.org/pdf/2408.15542), [blog](https://kangaroogroup.github.io/Kangaroo.github.io/) and [github](https://github.com/KangarooGroup/Kangaroo) for details.
## Abstract
We introduce <strong>Kangaroo</strong>, a powerful Multimodal Large Language Model designed for long-context video understanding. Our presented Kangaroo model shows remarkable performance across diverse video understanding tasks including video caption, QA and conversation. Generally, our key contributions in this work can be summarized as follows:
<ol>
<li><strong>Long-context Video Input.</strong> To enhance the model's capability to comprehend longer videos, we extend the maximum frames of input videos to 160. To this end, we aggregate multiple videos with variable frame counts and aspect ratios into one sample. We further design a spatial-temporal pathify module to improve training efficiency.</li>
<li><strong>Strong Performance.</strong> We evaluate our model across various video understanding benchmarks. The results indicate that our model achieves state-of-the-art performance on the majority of comprehensive benchmarks and maintain a competitive level in others. Notably, our model outperforms most larger open-source models with over 30B parameters and some proprietary models on certain benchmarks.</li>
<li><strong>Video Annotation System.</strong> We develop a data curation and automatic annotation system to generate captions for open-source and internal videos. The generated large-scale dataset are utilized for video-text pre-training. For video instruction tuning stage, we construct a video instruciton tuning dataset based on public and internal datasets covering a variety of tasks.</li>
<li><strong>Billingual Conversation.</strong> Our proposed model is equipped with the capability of Chinese, English and billingual conversations, and support single/multi-round conversation paradigms.
</li>
</ol>
## Quick Start
### Installation
See our [github page](https://github.com/KangarooGroup/Kangaroo)
### Multi-round Chat with 🤗 Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("KangarooGroup/kangaroo")
model = AutoModelForCausalLM.from_pretrained(
"KangarooGroup/kangaroo",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = model.to("cuda")
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
video_path = "/path/to/video"
# Round 1
query = "Give a brief description of the video."
out, history = model.chat(video_path=video_path,
query=query,
tokenizer=tokenizer,
max_new_tokens=512,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,)
print('Assitant: \n', out)
# Round 2
query = "What happend at the end of the video?"
out, history = model.chat(video_path=video_path,
query=query,
history=history,
tokenizer=tokenizer,
max_new_tokens=512,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,)
print('Assitant: \n', out)
```
## Citation
If you find it useful for your research , please cite related papers/blogs using this BibTeX:
```bibtex
@misc{kangaroogroup,
title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
url={https://kangaroogroup.github.io/Kangaroo.github.io/},
author={Jiajun Liu and Yibing Wang and Hanghang Ma and Xiaoping Wu and Xiaoqi Ma and Jie Hu},
month={July},
year={2024}
} |