File size: 4,829 Bytes
a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 0ec419d a8d0847 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
library_name: transformers
license: apache-2.0
datasets:
- deepvk/LLaVA-Instruct-ru
- Lin-Chen/ShareGPT4V
- deepvk/GQA-ru
language:
- ru
- en
base_model: IlyaGusev/saiga_llama3_8b
pipeline_tag: image-text-to-text
---
# LLaVA-Saiga-8b
LLaVA-Saiga-8b is a Vision-Language Model (VLM) based on [`IlyaGusev/saiga_llama3_8b`](https://huggingface.co/IlyaGusev/saiga_llama3_8b) model
and trained in original LLaVA setup. This model is primarily adapted to work with Russian, but still capable to work with English.
## Usage
Model usage is simple via `transformers` API
```python
import requests
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration
model_name = "deepvk/llava-saiga-8b"
model = LlavaForConditionalGeneration.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
img = Image.open(requests.get(url, stream=True).raw)
messages = [
{"role": "user", "content": "<image>\nОпиши картинку несколькими словами."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[img], text=text, return_tensors="pt")
generate_ids = model.generate(**inputs, max_new_tokens=30)
answer = tokenizer.decode(generate_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(answer)
```
Use the `<image>` tag to point to an image in the text and follow the chat template for a multi-turn conversation.
The model is capable of chatting without any images or working with multiple images in a conversation, but this behavior has not been tested.
The model format allows it to be directly used in popular frameworks,
e.g. you can test the model using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), see Results section for details.
## Train
To train this model, we follow the original LLaVA pipeline and reuse [`haotian-liu/LLaVA`](https://github.com/haotian-liu/LLaVA) framework.
The model was trained in two stages:
1. The adapter was trained using pre-training data from [`ShareGPT4V`](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V).
2. Instruction tuning included training the LLM and the adapter, for this we use:
* [`deepvk/LLaVA-Instruct-ru`](https://huggingface.co/datasets/deepvk/LLaVA-Instruct-ru) - our new dataset of VLM instructions in Russian
* [`deepvk/GQA-ru`](https://huggingface.co/datasets/deepvk/GQA-ru) - the training part of the popular GQA test, translated into Russian, we used the post-prompt "Ответь одним словом. ".
* We also used instruction data from ShareGPT4V.
The entire training process took 3-4 days on 8 x A100 80GB.
## Results
The model's performance was evaluated using [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main) framework
```bash
accelerate launch -m lmms_eval --model llava_hf --model_args pretrained="deepvk/llava-saiga-8b" \
--tasks gqa-ru,mmbench_ru_dev,gqa,mmbench_en_dev --batch_size 1 \
--log_samples --log_samples_suffix llava-saiga-8b --output_path ./logs/
```
| Model | GQA | GQA-ru | MMBench | MMBench-ru |
| ----------------------------------------------------------------------------------------------- |:---------:|:---------:|:---------:|:----------:|
| [`deepvk/llava-gemma-2b-lora`](https://huggingface.co/deepvk/llava-gemma-2b-lora) | 56.39 | 46.37 | 51.72 | 40.19 |
| [`Intel/llava-gemma-2b`](https://huggingface.co/Intel/llava-gemma-2b) | 59.80 | 0.20 | 39.40 | 28.30 |
| `deepvk/llava-saiga-8b` [this model] | 62.00 | **51.44** | 64.26 | **56.65** |
| [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | 61.31 | 28.39 | 62.97 | 52.25 |
| [`llava-hf/llava-v1.6-mistral-7b-hf`](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) | **64.65** | 6.65 | **67.70** | 48.80 |
*Note*: for MMBench we didn't use OpenAI API for finding quantifier in generated string. Therefore, the score is similar to Exact Match as in GQA benchmark.
## Citation
```
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={NeurIPS},
year={2023},
}
```
```
@misc{deepvk2024llava-saiga-8b,
title={LLaVA-Saiga-8b},
author={Belopolskih, Daniil and Spirin, Egor},
url={https://huggingface.co/deepvk/llava-saiga-8b},
publisher={Hugging Face}
year={2024},
}
```
|