fede97 commited on
Commit
cfaf2a7
·
1 Parent(s): 939ed4f
Files changed (1) hide show
  1. README.md +1 -40
README.md CHANGED
@@ -26,7 +26,6 @@ datasets:
26
  LLaVA-MORE is a new family of Multimodal Large Language Models (MLLMs) that integrates recent language models with diverse visual backbones. This specific model, `LLaVA_MORE-gemma_2_9b-finetuning`, is fine-tuned on `LLaVA-Instruct-665K` using `gemma-2-9b-it` as the LLM backbone and a CLIP-based visual backbone. It is designed to evaluate multimodal reasoning, generation, and instruction following tasks.
27
 
28
  <div align="center">
29
- <img src="https://github.com/aimagelab/LLaVA-MORE/raw/main/images/image_no_back.png" width="200" height="200">
30
  <h1> 🔥 LLaVA-MORE 🔥
31
 
32
  A Comparative Study of LLMs and Visual Backbones <br>for Enhanced Visual Instruction Tuning
@@ -94,44 +93,6 @@ The models are trained on large-scale datasets that may contain societal biases,
94
  ### Recommendations
95
  Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to carefully evaluate the model's outputs for their specific use case and consider implementing additional safeguards or human oversight, especially in high-stakes scenarios. Understanding the limitations arising from the training data and model architecture is crucial.
96
 
97
- ## How to Get Started with the Model
98
- Use the code below to get started with the model.
99
-
100
- ```python
101
- from transformers import AutoProcessor, LlavaGemmaForCausalLM
102
- from PIL import Image
103
- import requests
104
-
105
- # Load model and processor
106
- model_id = "aimagelab/LLaVA_MORE-gemma_2_9b-finetuning" # This is the model card for this specific variant
107
- model = LlavaGemmaForCausalLM.from_pretrained(
108
- model_id,
109
- torch_dtype="auto",
110
- device_map="auto"
111
- )
112
- processor = AutoProcessor.from_pretrained(model_id)
113
-
114
- # Prepare inputs
115
- image_url = "https://llava-vl.github.io/static/images/a-chat-with-llava.jpg" # Example image from LLaVA project
116
- raw_image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
117
-
118
- prompt = "Describe the image in detail."
119
- messages = [
120
- {"role": "user", "content": "<image>" + prompt},
121
- ]
122
- text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
123
- inputs = processor(text=text, images=raw_image, return_tensors="pt")
124
-
125
- # Generate output
126
- output = model.generate(**inputs, max_new_tokens=256) # Increased max_new_tokens for potentially more detailed description
127
- generated_text = processor.decode(output[0], skip_special_tokens=True)
128
-
129
- print(f"User: {prompt}
130
- Assistant: {generated_text}")
131
- ```
132
-
133
- If you encounter out-of-memory problems, consider loading the model weights in 8-bit (`load_in_8bit=True`) or 4-bit (`load_in_4bit=True`).
134
-
135
  ## Training Details
136
 
137
  ### Training Data
@@ -269,4 +230,4 @@ We are also happy users of the [lmms-eval](https://github.com/EvolvingLMMs-Lab/l
269
  Niels (Hugging Face Community Science Team)
270
 
271
  ## Model Card Contact
272
- AImageLab (via GitHub issues on the repository)
 
26
  LLaVA-MORE is a new family of Multimodal Large Language Models (MLLMs) that integrates recent language models with diverse visual backbones. This specific model, `LLaVA_MORE-gemma_2_9b-finetuning`, is fine-tuned on `LLaVA-Instruct-665K` using `gemma-2-9b-it` as the LLM backbone and a CLIP-based visual backbone. It is designed to evaluate multimodal reasoning, generation, and instruction following tasks.
27
 
28
  <div align="center">
 
29
  <h1> 🔥 LLaVA-MORE 🔥
30
 
31
  A Comparative Study of LLMs and Visual Backbones <br>for Enhanced Visual Instruction Tuning
 
93
  ### Recommendations
94
  Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to carefully evaluate the model's outputs for their specific use case and consider implementing additional safeguards or human oversight, especially in high-stakes scenarios. Understanding the limitations arising from the training data and model architecture is crucial.
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  ## Training Details
97
 
98
  ### Training Data
 
230
  Niels (Hugging Face Community Science Team)
231
 
232
  ## Model Card Contact
233
+ AImageLab (via GitHub issues on the repository)