File size: 2,178 Bytes
0746f2a
 
eac705b
 
0746f2a
 
 
 
232c40f
0746f2a
 
 
 
a2e3489
0746f2a
 
a2e3489
0746f2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2e3489
 
 
 
 
 
 
 
 
 
d8daca5
a2e3489
 
 
 
 
d8daca5
a2e3489
 
 
 
 
 
d8daca5
a2e3489
 
0746f2a
a2e3489
0746f2a
a2e3489
 
 
56c6eda
a2e3489
 
 
 
 
0746f2a
a2e3489
0746f2a
a2e3489
0746f2a
a2e3489
0746f2a
a2e3489
0746f2a
 
 
 
 
 
 
 
 
 
 
 
 
a2e3489
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
license: mit
library_name: transformers
pipeline_tag: image-text-to-text
---
![header](./assets/header.png) 

<p align="center">
   📃 <a href="https://arxiv.org/abs/2409.02889" target="_blank">Paper</a> • 🌐 <a href="" target="_blank">Demo</a> • 📃 <a href="https://github.com/FreedomIntelligence/LongLLaVA" target="_blank">LongLLaVA</a> 
</p>

![efficiency](./assets/singleGPU.png) 


## 🌈 Update

* **[2024.09.05]** LongLLaVA repo is published!🎉 The Code will

## Architecture

<details>
  <summary>Click to view the architecture image</summary>

  ![Architecture Image](./assets/arch.png)

</details>


## Results

<details>
  <summary>Click to view the Results</summary>

  - Main Results
      ![Main Results](./assets/result1.png) 
  - Diagnostic Results
      ![Diagnostic Results](./assets/diaresult.png)
  - Video-NIAH
      ![Video-NIAH](./assets/NIAH.png)

</details>



## Results reproduction


### Evaluation

- Preparation

Get the model inference code from [Github](https://github.com/FreedomIntelligence/LongLLaVA).

```bash
git clone https://github.com/FreedomIntelligence/LongLLaVA.git
```

- Environment Setup

```bash
pip install -r requirements.txt
```


- Command Line Interface

```bash
python cli.py --model_dir path-to-longllava
```


- Model Inference

```python
query = 'What does the picture show?'
image_paths = ['image_path1'] # image or video path

from cli import Chatbot
bot = Chatbot(path-to-longllava)
output = bot.inference(query, image_paths)
print(output) # Prints the output of the model
```

## TO DO

- [ ] Release Data Construction Code

## Acknowledgement

- [LLaVA](https://github.com/haotian-liu/LLaVA): Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

## Citation

```
@misc{wang2024longllavascalingmultimodalllms,
      title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture}, 
      author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang},
      year={2024},
      eprint={2409.02889},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.02889}, 
}
```