Image-Text-to-Text
Transformers
Safetensors
llava_gemma
text-generation
multimodal
llava
gemma
visual-instruction-tuning
llm
vision-language-model
instruction-tuned
clip
llama-3-1
phi-4
siglip
siglip2
conversational
Instructions to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="aimagelab/LLaVA_MORE-gemma_2_9b-finetuning") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("aimagelab/LLaVA_MORE-gemma_2_9b-finetuning", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/aimagelab/LLaVA_MORE-gemma_2_9b-finetuning
- SGLang
How to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with Docker Model Runner:
docker model run hf.co/aimagelab/LLaVA_MORE-gemma_2_9b-finetuning
fede97 commited on
Commit ·
cfaf2a7
1
Parent(s): 939ed4f
readme
Browse files
README.md
CHANGED
|
@@ -26,7 +26,6 @@ datasets:
|
|
| 26 |
LLaVA-MORE is a new family of Multimodal Large Language Models (MLLMs) that integrates recent language models with diverse visual backbones. This specific model, `LLaVA_MORE-gemma_2_9b-finetuning`, is fine-tuned on `LLaVA-Instruct-665K` using `gemma-2-9b-it` as the LLM backbone and a CLIP-based visual backbone. It is designed to evaluate multimodal reasoning, generation, and instruction following tasks.
|
| 27 |
|
| 28 |
<div align="center">
|
| 29 |
-
<img src="https://github.com/aimagelab/LLaVA-MORE/raw/main/images/image_no_back.png" width="200" height="200">
|
| 30 |
<h1> 🔥 LLaVA-MORE 🔥
|
| 31 |
|
| 32 |
A Comparative Study of LLMs and Visual Backbones <br>for Enhanced Visual Instruction Tuning
|
|
@@ -94,44 +93,6 @@ The models are trained on large-scale datasets that may contain societal biases,
|
|
| 94 |
### Recommendations
|
| 95 |
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to carefully evaluate the model's outputs for their specific use case and consider implementing additional safeguards or human oversight, especially in high-stakes scenarios. Understanding the limitations arising from the training data and model architecture is crucial.
|
| 96 |
|
| 97 |
-
## How to Get Started with the Model
|
| 98 |
-
Use the code below to get started with the model.
|
| 99 |
-
|
| 100 |
-
```python
|
| 101 |
-
from transformers import AutoProcessor, LlavaGemmaForCausalLM
|
| 102 |
-
from PIL import Image
|
| 103 |
-
import requests
|
| 104 |
-
|
| 105 |
-
# Load model and processor
|
| 106 |
-
model_id = "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning" # This is the model card for this specific variant
|
| 107 |
-
model = LlavaGemmaForCausalLM.from_pretrained(
|
| 108 |
-
model_id,
|
| 109 |
-
torch_dtype="auto",
|
| 110 |
-
device_map="auto"
|
| 111 |
-
)
|
| 112 |
-
processor = AutoProcessor.from_pretrained(model_id)
|
| 113 |
-
|
| 114 |
-
# Prepare inputs
|
| 115 |
-
image_url = "https://llava-vl.github.io/static/images/a-chat-with-llava.jpg" # Example image from LLaVA project
|
| 116 |
-
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
|
| 117 |
-
|
| 118 |
-
prompt = "Describe the image in detail."
|
| 119 |
-
messages = [
|
| 120 |
-
{"role": "user", "content": "<image>" + prompt},
|
| 121 |
-
]
|
| 122 |
-
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 123 |
-
inputs = processor(text=text, images=raw_image, return_tensors="pt")
|
| 124 |
-
|
| 125 |
-
# Generate output
|
| 126 |
-
output = model.generate(**inputs, max_new_tokens=256) # Increased max_new_tokens for potentially more detailed description
|
| 127 |
-
generated_text = processor.decode(output[0], skip_special_tokens=True)
|
| 128 |
-
|
| 129 |
-
print(f"User: {prompt}
|
| 130 |
-
Assistant: {generated_text}")
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
If you encounter out-of-memory problems, consider loading the model weights in 8-bit (`load_in_8bit=True`) or 4-bit (`load_in_4bit=True`).
|
| 134 |
-
|
| 135 |
## Training Details
|
| 136 |
|
| 137 |
### Training Data
|
|
@@ -269,4 +230,4 @@ We are also happy users of the [lmms-eval](https://github.com/EvolvingLMMs-Lab/l
|
|
| 269 |
Niels (Hugging Face Community Science Team)
|
| 270 |
|
| 271 |
## Model Card Contact
|
| 272 |
-
AImageLab (via GitHub issues on the repository)
|
|
|
|
| 26 |
LLaVA-MORE is a new family of Multimodal Large Language Models (MLLMs) that integrates recent language models with diverse visual backbones. This specific model, `LLaVA_MORE-gemma_2_9b-finetuning`, is fine-tuned on `LLaVA-Instruct-665K` using `gemma-2-9b-it` as the LLM backbone and a CLIP-based visual backbone. It is designed to evaluate multimodal reasoning, generation, and instruction following tasks.
|
| 27 |
|
| 28 |
<div align="center">
|
|
|
|
| 29 |
<h1> 🔥 LLaVA-MORE 🔥
|
| 30 |
|
| 31 |
A Comparative Study of LLMs and Visual Backbones <br>for Enhanced Visual Instruction Tuning
|
|
|
|
| 93 |
### Recommendations
|
| 94 |
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to carefully evaluate the model's outputs for their specific use case and consider implementing additional safeguards or human oversight, especially in high-stakes scenarios. Understanding the limitations arising from the training data and model architecture is crucial.
|
| 95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
## Training Details
|
| 97 |
|
| 98 |
### Training Data
|
|
|
|
| 230 |
Niels (Hugging Face Community Science Team)
|
| 231 |
|
| 232 |
## Model Card Contact
|
| 233 |
+
AImageLab (via GitHub issues on the repository)
|