--- library_name: transformers license: apache-2.0 datasets: - deepvk/LLaVA-Instruct-ru - Lin-Chen/ShareGPT4V - deepvk/GQA-ru language: - ru - en base_model: google/gemma-2b-it pipeline_tag: image-text-to-text --- # LLaVA-Gemma-2b-LORA LLaVA-Gemma-2b-LORA is a Vision-Language Model (VLM) based on [`google/gemma-2b-it`](https://huggingface.co/google/gemma-2b-it) model and trained in original LLaVA setup using LORA. This model is primarily adapted to work with Russian, but still capable to work with English. ## Usage Model usage is simple via `transformers` API ```python import requests from PIL import Image from transformers import AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration model_name = "deepvk/llava-gemma-2b-lora" model = LlavaForConditionalGeneration.from_pretrained(model_name) processor = AutoProcessor.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) url = "https://www.ilankelman.org/stopsigns/australia.jpg" img = Image.open(requests.get(url, stream=True).raw) messages = [ {"role": "user", "content": "\nОпиши картинку несколькими словами."} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(images=[img], text=text, return_tensors="pt") generate_ids = model.generate(**inputs, max_new_tokens=30) answer = tokenizer.decode(generate_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True) print(answer) ``` Use the `` tag to point to an image in the text and follow the chat template for a multi-turn conversation. The model is capable of chatting without any images or working with multiple images in a conversation, but this behavior has not been tested. The model format allows it to be directly used in popular frameworks, e.g. you can test the model using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), see Results section for details. ## Train To train this model, we follow the original LLaVA pipeline and reuse [`haotian-liu/LLaVA`](https://github.com/haotian-liu/LLaVA) framework. The model was trained in two stages: 1. The adapter was trained using pre-training data from [`ShareGPT4V`](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V). 2. Instruction tuning included training the LLM and the adapter, for this we use: * [`deepvk/LLaVA-Instruct-ru`](https://huggingface.co/datasets/deepvk/LLaVA-Instruct-ru) — our new dataset of VLM instructions in Russian * [`deepvk/GQA-ru`](https://huggingface.co/datasets/deepvk/GQA-ru) — the training part of the popular GQA test, translated into Russian, we used the post-prompt "Ответь одним словом. ". * We also used instruction data from ShareGPT4V. The entire training process took 3 days on a single A100 40GB. ## Results The model's performance was evaluated using [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main) framework ```bash accelerate launch -m lmms_eval --model llava_hf --model_args pretrained="deepvk/llava-gemma-2b-lora" \ --tasks gqa-ru,mmbench_ru_dev,gqa,mmbench_en_dev --batch_size 1 \ --log_samples --log_samples_suffix llava-saiga-8b --output_path ./logs/ ``` | Model | GQA | GQA-ru | MMBench | MMBench-ru | | ----------------------------------------------------------------------------------------------- |:------------:|:------------:|:------------:|:------------:| | `deepvk/llava-gemma-2b-lora` [this model] | 56.39 | 46.37 | 51.72 | 40.19 | | [`Intel/llava-gemma-2b`](https://huggingface.co/Intel/llava-gemma-2b) | 59.80 | 0.20 | 39.40 | 28.30 | | [`deepvk/llava-saiga-8b`](https://huggingface.co/deepvk/llava-saiga-8b) | 62.00 | **51.44** | 64.26 | **56.65** | | [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | 61.31 | 28.39 | 62.97 | 52.25 | | [`llava-hf/llava-v1.6-mistral-7b-hf`](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) | **64.65** | 6.65 | **67.70** | 48.80 | *Note*: for MMBench we didn't use OpenAI API for finding quantifier in generated string. Therefore, the score is similar to Exact Match as in GQA benchmark. ## Citation ``` @misc{liu2023llava, title={Visual Instruction Tuning}, author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae}, publisher={NeurIPS}, year={2023}, } ``` ``` @misc{deepvk2024llava-gemma-2b-lora, title={LLaVA-Gemma-2b-LORA}, author={Belopolskih, Daniil and Spirin, Egor}, url={https://huggingface.co/deepvk/llava-gemma-2b-lora}, publisher={Hugging Face} year={2024}, } ```