--- language: - en tags: - llava - multimodal - qwen license: apache-2.0 pipeline_tag: image-text-to-text --- # nanoLLaVA-1.5 - Improved sub 1B Vision-Language Model
## Description nanoLLaVA-1.5 is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. This is an update from the v1.0 version [qnguyen3/nanoLLaVA](https://huggingface.co/qnguyen3/nanoLLaVA) - **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B) - **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | Model | **VQA v2** | **TextVQA** | **ScienceQA** | **POPE** | **MMMU (Test)** | **MMMU (Eval)** | **GQA** | **MM-VET** | |---------|--------|---------|-----------|------|-------------|-------------|------|--------| | nanoLLavA-1.0 | 70.84 | 46.71 | 58.97 | 84.1 | 28.6 | 30.4 | 54.79| 23.9 | | nanoLLavA-1.5 | TBD | TBD | TBD | TBD | TBD | TBD | TBD| TBD | ## Training Data Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one. ## Finetuning Code Coming Soon!!! ## Usage You can use with `transformers` with the following script: ```bash pip install -U transformers accelerate flash_attn ``` ```python import torch import transformers from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image import warnings # disable some warnings transformers.logging.set_verbosity_error() transformers.logging.disable_progress_bar() warnings.filterwarnings('ignore') # set device torch.set_default_device('cuda') # or 'cpu' model_name = 'qnguyen3/nanoLLaVA-1.5' # create model model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained( model_name, trust_remote_code=True) # text prompt prompt = 'Describe this image in detail' messages = [ {"role": "user", "content": f'