llava-med-v1.5-mistral-7b-hf
This repository contains a drop-in, Hugging Face–compatible checkpoint converted from
https://huggingface.co/microsoft/llava-med-v1.5-mistral-7b.
You can load it with the exact same code you use for the original model—no extra conversion steps required.
✅ Clarification on Vocab Size Expansion & Weight Integrity
You can refer to the original conversion code available in our repository to access the relevant implementation details and code snippets.
You may wonder if the vocab size mismatch (e.g., 32000 vs. 32064) breaks the original weights by causing dimension mismatches between embed_tokens/lm_head and the original weights.
The answer is no. Below are key clarifications:
1. Does expanding embed_tokens affect the attention mechanism?
No. The embed_tokens layer is simply an embedding lookup table (nn.Embedding) that maps token IDs to vectors.
The attention mechanism (e.g., nn.MultiheadAttention or LlamaAttention) does not operate on this lookup table directly. It only cares if the dimension of input hidden states is consistent.
Thus, even if you add more tokens, as long as the hidden_size remains unchanged, the weight shape of the attention layer is not affected at all.
2. Which layers actually change after expansion?
| Module | Weight Shape Change | Affects Attention Calculation? |
|---|---|---|
| embed_tokens | [vocab_size+2, hidden_size] | ❌ No |
| lm_head | [vocab_size+2, hidden_size] | ❌ No |
| All attention layers | No shape change | ✅ Completely unchanged |
Analogy for easier understanding
Think of the model as a dictionary:
- embed_tokens is the "new word list": You add two new words, but the length of each word’s explanation (hidden_size) stays the same.
- Attention is the "reading rule": It only focuses on how the vectors of each word in a sentence interact with each other, not how many words are in the dictionary.
The only "change" lies in input distribution
- The original model never encountered embeddings for tokens like
<image>or<pad>. - Now these tokens are initialized, and their new vectors appear during the first forward pass.
- This is a data-level change, not damage to the model’s parameters.
Summary
Expanding the vocab only changes the "dictionary size", not the "reading rules". The weight shape and calculation logic of the attention mechanism remain completely unchanged. The output layer (lm_head) changes in shape but not in functional logic.
3. What exactly changes in the output layer (lm_head)?
The line below modifies the weight matrix of lm_head (the language modeling head):
model.resize_token_embeddings(config.text_config.vocab_size + 2, pad_shape)
Quick Start
from transformers import LlavaForConditionalGeneration, AutoProcessor
import torch
model_path = "chaoyinshe/llava-med-v1.5-mistral-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # requires FA2
device_map="auto" # multi-GPU ready
)
processor = AutoProcessor.from_pretrained(model_path)
# Example inference
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is the main finding in this chest X-ray?"}
]
}
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
images=[image], text=prompt, return_tensors="pt"
).to(model.device, torch.bfloat16)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out[0], skip_special_tokens=True))
✅ Training Screenshot
(Note: The image below is for illustrative purposes only — actual training metrics may vary.) 🤗
- Downloads last month
- 949
Model tree for chaoyinshe/llava-med-v1.5-mistral-7b-hf
Base model
microsoft/llava-med-v1.5-mistral-7b

