ruili0 commited on
Commit
a14b097
·
verified ·
1 Parent(s): 1a4b2a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -1
README.md CHANGED
@@ -5,4 +5,100 @@ datasets:
5
  base_model:
6
  - lmms-lab/LongVA-7B
7
  library_name: transformers
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - lmms-lab/LongVA-7B
7
  library_name: transformers
8
+ ---
9
+ ---
10
+ library_name: transformers
11
+ license: mit
12
+ ---
13
+
14
+
15
+
16
+ <a href='https://arxiv.org/abs/'><img src='https://img.shields.io/badge/arXiv-paper-red'></a><a href='https://ruili33.github.io/tpo_website.github.io/'><img src='https://img.shields.io/badge/project-TPO-blue'></a><a href='https://huggingface.co/collections/ruili0/temporal-preference-optimization-67874b451f65db189fa35e10'><img src='https://img.shields.io/badge/huggingface-datasets-green'></a>
17
+ <a href='https://huggingface.co/collections/ruili0/temporal-preference-optimization-67874b451f65db189fa35e10'><img src='https://img.shields.io/badge/model-checkpoints-yellow'></a>
18
+ <a href='https://github.com/ruili33/TPO'><img src='https://img.shields.io/badge/github-repository-purple'></a>
19
+ <img src="cvpr_figure_TPO.png"></img>
20
+ # LongVA-7B-TPO
21
+
22
+ LongVA-7B-TPO, introduced by paper [Temporal Preference Optimization for Long-form Video Understanding](https://arxiv.org/abs), optimized
23
+ by temporal preference based on LongVA-7B. The LongVA-7B-TPO model establishes state-of-the-art performance across a range of
24
+ benchmarks, demonstrating an average performance improvement of 2% compared to LongVA-7B.
25
+
26
+
27
+
28
+
29
+ ## Evaluation Results
30
+ | **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Short)** | **VideoMME (Medium)** | **VideoMME (Long)** | **VideoMME (Average)** |
31
+ |-------------------------------------|----------|---------------------|----------|----------------------|-----------------------|----------------------|-------------------------|
32
+ | **LongLLaVA [1]** | 7B | - | 56.3 | 61.9/66.2 | 51.4/54.7 | 45.4/50.3 | 52.9/57.1 |
33
+ | **Video-CCAM [2]** | 14B | - | 63.1 | 62.2/66.0 | 50.6/56.3 | 46.7/49.9 | 53.2/57.4 |
34
+ | **LongVA-7B [3]** | 7B | 51.3 | 58.8 | 61.3/61.6 | 50.4/53.6 | 46.2/47.6 | 52.6/54.3 |
35
+ | **LongVA-TPO (ours)** | 7B | 54.2 | 61.7 | 63.1/66.6 | 54.8/55.3 | 47.4/47.9 | 55.1/56.6 |
36
+
37
+ ## Get Started
38
+
39
+ Use the code below to get started with the model. For more information, please refer to our [github repository](https://github.com/ruili33/TPO).
40
+
41
+ ```
42
+ from longva.model.builder import load_pretrained_model
43
+ from longva.mm_utils import tokenizer_image_token, process_images
44
+ from longva.constants import IMAGE_TOKEN_INDEX
45
+ from PIL import Image
46
+ from decord import VideoReader, cpu
47
+ import torch
48
+ import numpy as np
49
+ # fix seed
50
+ torch.manual_seed(0)
51
+
52
+ model_path = "ruili0/LongVA-TPO"
53
+ image_path = "local_demo/assets/lmms-eval.png"
54
+ video_path = "local_demo/assets/dc_demo.mp4"
55
+ max_frames_num = 16 # you can change this to several thousands so long you GPU memory can handle it :)
56
+ gen_kwargs = {"do_sample": True, "temperature": 0.5, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 1024}
57
+ # you can also set the device map to auto to accomodate more frames
58
+ tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_qwen", device_map="cuda:0")
59
+
60
+
61
+ #image input
62
+ prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nDescribe the image in details.<|im_end|>\n<|im_start|>assistant\n"
63
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
64
+ image = Image.open(image_path).convert("RGB")
65
+ images_tensor = process_images([image], image_processor, model.config).to(model.device, dtype=torch.float16)
66
+ with torch.inference_mode():
67
+ output_ids = model.generate(input_ids, images=images_tensor, image_sizes=[image.size], modalities=["image"], **gen_kwargs)
68
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
69
+ print(outputs)
70
+ print("-"*50)
71
+
72
+ #video input
73
+ prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nGive a detailed caption of the video as if I am blind.<|im_end|>\n<|im_start|>assistant\n"
74
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
75
+ vr = VideoReader(video_path, ctx=cpu(0))
76
+ total_frame_num = len(vr)
77
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
78
+ frame_idx = uniform_sampled_frames.tolist()
79
+ frames = vr.get_batch(frame_idx).asnumpy()
80
+ video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)
81
+ with torch.inference_mode():
82
+ output_ids = model.generate(input_ids, images=[video_tensor], modalities=["video"], **gen_kwargs)
83
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
84
+ print(outputs)
85
+ ```
86
+
87
+ ## License
88
+
89
+ This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models (Qwen2 license). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
90
+
91
+
92
+ ## Citation [optional]
93
+
94
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
95
+
96
+ **BibTeX:**
97
+
98
+ [More Information Needed]
99
+
100
+ **References:**
101
+
102
+ [1]. Wang, X., Song, D., Chen, S., Zhang, C., & Wang, B. (2024). LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture. arXiv preprint arXiv:2409.02889.
103
+ [2]. Fei, J., Li, D., Deng, Z., Wang, Z., Liu, G., & Wang, H. (2024). Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos. arXiv preprint arXiv:2408.14023.
104
+ [3]. Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., ... & Liu, Z. (2024). Long context transfer from language to vision. arXiv preprint arXiv:2406.16852.