File size: 2,131 Bytes
0746f2a
 
eac705b
 
0746f2a
 
 
 
232c40f
0746f2a
 
 
 
a2e3489
0746f2a
 
a2e3489
0746f2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2e3489
 
 
 
 
 
 
 
 
 
d8daca5
a2e3489
 
 
 
 
d8daca5
a2e3489
 
 
 
 
 
d8daca5
a2e3489
 
0746f2a
a2e3489
0746f2a
a2e3489
 
 
56c6eda
a2e3489
 
 
 
 
0746f2a
 
a2e3489
0746f2a
a2e3489
0746f2a
 
 
 
 
 
 
 
 
 
 
 
 
a2e3489
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
license: mit
library_name: transformers
pipeline_tag: image-text-to-text
---
![header](./assets/header.png) 

<p align="center">
   📃 <a href="https://arxiv.org/abs/2409.02889" target="_blank">Paper</a> • 🌐 <a href="" target="_blank">Demo</a> • 📃 <a href="https://github.com/FreedomIntelligence/LongLLaVA" target="_blank">LongLLaVA</a> 
</p>

![efficiency](./assets/singleGPU.png) 


## 🌈 Update

* **[2024.09.05]** LongLLaVA repo is published!🎉 The Code will

## Architecture

<details>
  <summary>Click to view the architecture image</summary>

  ![Architecture Image](./assets/arch.png)

</details>


## Results

<details>
  <summary>Click to view the Results</summary>

  - Main Results
      ![Main Results](./assets/result1.png) 
  - Diagnostic Results
      ![Diagnostic Results](./assets/diaresult.png)
  - Video-NIAH
      ![Video-NIAH](./assets/NIAH.png)

</details>



## Results reproduction


### Evaluation

- Preparation

Get the model inference code from [Github](https://github.com/FreedomIntelligence/LongLLaVA).

```bash
git clone https://github.com/FreedomIntelligence/LongLLaVA.git
```

- Environment Setup

```bash
pip install -r requirements.txt
```


- Command Line Interface

```bash
python cli.py --model_dir path-to-longllava
```


- Model Inference

```python
query = 'What does the picture show?'
image_paths = ['image_path1'] # image or video path

from cli import Chatbot
bot = Chatbot(path-to-longllava)
output = bot.inference(query, image_paths)
print(output) # Prints the output of the model
```


## Acknowledgement

- [LLaVA](https://github.com/haotian-liu/LLaVA): Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

## Citation

```
@misc{wang2024longllavascalingmultimodalllms,
      title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture}, 
      author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang},
      year={2024},
      eprint={2409.02889},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.02889}, 
}
```