File size: 4,077 Bytes

7301ccc
 
 
 
 
2d82477
388bd8b
 
 
8504fe2
388bd8b
7d6a6dd
 
 
 
 
 
 
 
 
388bd8b
7d6a6dd
 
88e47d9
a407a33
85d6546
7d6a6dd
0a3336d
 
 
 
3da7b6a
0a3336d
3da7b6a
0a3336d
 
 
 
 
c0ccd82
cc4b78f
0a3336d
7d6a6dd
9eeb018
0a3336d
 
 
 
 
 
 
 
cc4b78f
0a3336d
7d6a6dd
9eeb018
0a3336d
 
bd702ad
0a3336d
 
 
 
 
 
cc4b78f
0a3336d
 
7d6a6dd
388bd8b
 
 
 
7d6a6dd
388bd8b
 
 
 
 
2d82477

---
license: apache-2.0
language:
- en
- zh
pipeline_tag: video-text-to-text
---
# Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

**Kangaroo** has been released. Please check out our [paper](https://arxiv.org/pdf/2408.15542), [blog](https://kangaroogroup.github.io/Kangaroo.github.io/) and [github](https://github.com/KangarooGroup/Kangaroo) for details.

## Abstract
We introduce <strong>Kangaroo</strong>, a powerful Multimodal Large Language Model designed for long-context video understanding. Our presented Kangaroo model shows remarkable performance across diverse video understanding tasks including video caption, QA and conversation. Generally, our key contributions in this work can be summarized as follows:
<ol>
    <li><strong>Long-context Video Input.</strong> To enhance the model's capability to comprehend longer videos, we extend the maximum frames of input videos to 160. To this end, we aggregate multiple videos with variable frame counts and aspect ratios into one sample. We further design a spatial-temporal pathify module to improve training efficiency.</li>
    <li><strong>Strong Performance.</strong> We evaluate our model across various video understanding benchmarks. The results indicate that our model achieves state-of-the-art performance on the majority of comprehensive benchmarks and maintain a competitive level in others. Notably, our model outperforms most larger open-source models with over 30B parameters and some proprietary models on certain benchmarks.</li>
    <li><strong>Video Annotation System.</strong> We develop a data curation and automatic annotation system to generate captions for open-source and internal videos. The generated large-scale dataset are utilized for video-text pre-training. For video instruction tuning stage, we construct a video instruciton tuning dataset based on public and internal datasets covering a variety of tasks.</li>
    <li><strong>Billingual Conversation.</strong> Our proposed model is equipped with the capability of Chinese, English and billingual conversations, and support single/multi-round conversation paradigms.
    </li>
</ol>

## Quick Start

### Installation
See our [github page](https://github.com/KangarooGroup/Kangaroo)

### Multi-round Chat with 🤗 Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("KangarooGroup/kangaroo")
model = AutoModelForCausalLM.from_pretrained(
    "KangarooGroup/kangaroo",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = model.to("cuda")
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]

video_path = "/path/to/video"

# Round 1
query = "Give a brief description of the video."
out, history = model.chat(video_path=video_path,
                          query=query,
                          tokenizer=tokenizer,
                          max_new_tokens=512,
                          eos_token_id=terminators,
                          do_sample=True,
                          temperature=0.6,
                          top_p=0.9,)
print('Assitant: \n', out)

# Round 2
query = "What happend at the end of the video?"
out, history = model.chat(video_path=video_path,
                          query=query,
                          history=history,
                          tokenizer=tokenizer,
                          max_new_tokens=512,
                          eos_token_id=terminators,
                          do_sample=True,
                          temperature=0.6,
                          top_p=0.9,)
print('Assitant: \n', out)
```

## Citation

If you find it useful for your research , please cite related papers/blogs using this BibTeX:
```bibtex

@misc{kangaroogroup,
	title={Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input},
	url={https://kangaroogroup.github.io/Kangaroo.github.io/},
	author={Jiajun Liu and Yibing Wang and Hanghang Ma and Xiaoping Wu and Xiaoqi Ma and Jie Hu},
	month={July},
	year={2024}
}