--- language: - en tags: - llava - multimodal - qwen license: apache-2.0 pipeline_tag: image-text-to-text --- # nanoLLaVA-1.5 - Improved sub 1B Vision-Language Model

Logo

## Description nanoLLaVA-1.5 is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. This is an update from the v1.0 version [qnguyen3/nanoLLaVA](https://huggingface.co/qnguyen3/nanoLLaVA) - **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B) - **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | Model | **VQA v2** | **TextVQA** | **ScienceQA** | **POPE** | **MMMU (Test)** | **MMMU (Eval)** | **GQA** | **MM-VET** | |---------|--------|---------|-----------|------|-------------|-------------|------|--------| | nanoLLavA-1.0 | 70.84 | 46.71 | 58.97 | 84.1 | 28.6 | 30.4 | 54.79| 23.9 | | nanoLLavA-1.5 | TBD | TBD | TBD | TBD | TBD | TBD | TBD| TBD | ## Training Data Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one. ## Finetuning Code Coming Soon!!! ## Usage You can use with `transformers` with the following script: ```bash pip install -U transformers accelerate flash_attn ``` ```python import torch import transformers from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image import warnings # disable some warnings transformers.logging.set_verbosity_error() transformers.logging.disable_progress_bar() warnings.filterwarnings('ignore') # set device torch.set_default_device('cuda') # or 'cpu' model_name = 'qnguyen3/nanoLLaVA-1.5' # create model model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained( model_name, trust_remote_code=True) # text prompt prompt = 'Describe this image in detail' messages = [ {"role": "user", "content": f'\n{prompt}'} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) print(text) text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('')] input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0) # image, sample images can be found in images folder image = Image.open('/path/to/image.png') image_tensor = model.process_images([image], model.config).to(dtype=model.dtype) # generate output_ids = model.generate( input_ids, images=image_tensor, max_new_tokens=2048, use_cache=True)[0] print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()) ``` ## Prompt Format The model follow the ChatML standard, however, without `\n` at the end of `<|im_end|>`: ``` <|im_start|>system Answer the question<|im_end|><|im_start|>user What is the picture about?<|im_end|><|im_start|>assistant ``` Model is trained using a modified version from [Bunny](https://github.com/BAAI-DCAI/Bunny/tree/main/bunny)