qnguyen3
/

nanoLLaVA-1.5

Image-Text-to-Text

text-generation

Inference Endpoints

Model card Files Files and versions Community

nanoLLaVA-1.5 / README.md

qnguyen3's picture

Update README.md

9e15c84 verified 6 months ago

|

3.72 kB

	---
	language:
	- en
	tags:
	- llava
	- multimodal
	- qwen
	license: apache-2.0
	---
	# nanoLLaVA-1.5 - Improved sub 1B Vision-Language Model

	<p align="center">
	<img src="https://i.postimg.cc/d15k3YNG/nanollava.webp" alt="Logo" width="350">
	</p>

	## Description
	nanoLLaVA-1.5 is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. This is an update from the v1.0 version [qnguyen3/nanoLLaVA](https://huggingface.co/qnguyen3/nanoLLaVA)
	- Base LLM: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B)
	- Vision Encoder: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)

	\| Model \| VQA v2 \| TextVQA \| ScienceQA \| POPE \| MMMU (Test) \| MMMU (Eval) \| GQA \| MM-VET \|
	\|---------\|--------\|---------\|-----------\|------\|-------------\|-------------\|------\|--------\|
	\| nanoLLavA-1.0 \| 70.84 \| 46.71 \| 58.97 \| 84.1 \| 28.6 \| 30.4 \| 54.79\| 23.9 \|
	\| nanoLLavA-1.5 \| TBD \| TBD \| TBD \| TBD \| TBD \| TBD \| TBD\| TBD \|

	## Training Data
	Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one.

	## Finetuning Code
	Coming Soon!!!

	## Usage
	You can use with `transformers` with the following script:

	```bash
	pip install -U transformers accelerate flash_attn
	```

	```python
	import torch
	import transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from PIL import Image
	import warnings

	# disable some warnings
	transformers.logging.set_verbosity_error()
	transformers.logging.disable_progress_bar()
	warnings.filterwarnings('ignore')

	# set device
	torch.set_default_device('cuda') # or 'cpu'

	model_name = 'qnguyen3/nanoLLaVA-1.5'

	# create model
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.float16,
	device_map='auto',
	trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(
	model_name,
	trust_remote_code=True)

	# text prompt
	prompt = 'Describe this image in detail'

	messages = [
	{"role": "user", "content": f'<image>\n{prompt}'}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	print(text)

	text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
	input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

	# image, sample images can be found in images folder
	image = Image.open('/path/to/image.png')
	image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

	# generate
	output_ids = model.generate(
	input_ids,
	images=image_tensor,
	max_new_tokens=2048,
	use_cache=True)[0]

	print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
	```

	## Prompt Format
	The model follow the ChatML standard, however, without `\n` at the end of `<\|im_end\|>`:
	```
	<\|im_start\|>system
	Answer the question<\|im_end\|><\|im_start\|>user
	<image>
	What is the picture about?<\|im_end\|><\|im_start\|>assistant
	```

	<!-- ---
	\| Image \| Example \|
	\|--------------------------------------\|---------------------------------------------------------------------------------------------\|
	\| ![small](example_1.png) \| What is the text saying? <br> "Small but mighty". <br>How does the text correlate to the context of the image? <br> The text seems to be a playful or humorous representation of a small but mighty figure, possibly a mouse or a mouse toy, holding a weightlifting bar. \|
	--- -->