Instructions to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="aimagelab/LLaVA_MORE-gemma_2_9b-finetuning")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("aimagelab/LLaVA_MORE-gemma_2_9b-finetuning", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/aimagelab/LLaVA_MORE-gemma_2_9b-finetuning

SGLang

How to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use aimagelab/LLaVA_MORE-gemma_2_9b-finetuning with Docker Model Runner:
```
docker model run hf.co/aimagelab/LLaVA_MORE-gemma_2_9b-finetuning
```

fede97 commited on Aug 2, 2025

Commit

cfaf2a7

1 Parent(s): 939ed4f

readme

Browse files

Files changed (1) hide show

README.md +1 -40

README.md CHANGED Viewed

@@ -26,7 +26,6 @@ datasets:
 LLaVA-MORE is a new family of Multimodal Large Language Models (MLLMs) that integrates recent language models with diverse visual backbones. This specific model, `LLaVA_MORE-gemma_2_9b-finetuning`, is fine-tuned on `LLaVA-Instruct-665K` using `gemma-2-9b-it` as the LLM backbone and a CLIP-based visual backbone. It is designed to evaluate multimodal reasoning, generation, and instruction following tasks.
 <div align="center">
-  <img src="https://github.com/aimagelab/LLaVA-MORE/raw/main/images/image_no_back.png" width="200" height="200">
   <h1> 🔥 LLaVA-MORE 🔥
  A Comparative Study of LLMs and Visual Backbones <br>for Enhanced Visual Instruction Tuning
@@ -94,44 +93,6 @@ The models are trained on large-scale datasets that may contain societal biases,
 ### Recommendations
 Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to carefully evaluate the model's outputs for their specific use case and consider implementing additional safeguards or human oversight, especially in high-stakes scenarios. Understanding the limitations arising from the training data and model architecture is crucial.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-```python
-from transformers import AutoProcessor, LlavaGemmaForCausalLM
-from PIL import Image
-import requests
-# Load model and processor
-model_id = "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning" # This is the model card for this specific variant
-model = LlavaGemmaForCausalLM.from_pretrained(
-    model_id,
-    torch_dtype="auto",
-    device_map="auto"
-)
-processor = AutoProcessor.from_pretrained(model_id)
-# Prepare inputs
-image_url = "https://llava-vl.github.io/static/images/a-chat-with-llava.jpg" # Example image from LLaVA project
-raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
-prompt = "Describe the image in detail."
-messages = [
-    {"role": "user", "content": "<image>" + prompt},
-]
-text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = processor(text=text, images=raw_image, return_tensors="pt")
-# Generate output
-output = model.generate(**inputs, max_new_tokens=256) # Increased max_new_tokens for potentially more detailed description
-generated_text = processor.decode(output[0], skip_special_tokens=True)
-print(f"User: {prompt}
-Assistant: {generated_text}")
-```
-If you encounter out-of-memory problems, consider loading the model weights in 8-bit (`load_in_8bit=True`) or 4-bit (`load_in_4bit=True`).
 ## Training Details
 ### Training Data
@@ -269,4 +230,4 @@ We are also happy users of the [lmms-eval](https://github.com/EvolvingLMMs-Lab/l
 Niels (Hugging Face Community Science Team)
 ## Model Card Contact
-AImageLab (via GitHub issues on the repository)

 LLaVA-MORE is a new family of Multimodal Large Language Models (MLLMs) that integrates recent language models with diverse visual backbones. This specific model, `LLaVA_MORE-gemma_2_9b-finetuning`, is fine-tuned on `LLaVA-Instruct-665K` using `gemma-2-9b-it` as the LLM backbone and a CLIP-based visual backbone. It is designed to evaluate multimodal reasoning, generation, and instruction following tasks.
 <div align="center">
   <h1> 🔥 LLaVA-MORE 🔥
  A Comparative Study of LLMs and Visual Backbones <br>for Enhanced Visual Instruction Tuning
 ### Recommendations
 Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to carefully evaluate the model's outputs for their specific use case and consider implementing additional safeguards or human oversight, especially in high-stakes scenarios. Understanding the limitations arising from the training data and model architecture is crucial.
 ## Training Details
 ### Training Data
 Niels (Hugging Face Community Science Team)
 ## Model Card Contact
+AImageLab (via GitHub issues on the repository)