A newer version of this model is available: microsoft/llava-med-v1.5-mistral-7b

llava-med-v1.5-mistral-7b-hf

This repository contains a drop-in, Hugging Face–compatible checkpoint converted from
https://huggingface.co/microsoft/llava-med-v1.5-mistral-7b.
You can load it with the exact same code you use for the original model—no extra conversion steps required.

✅ Clarification on Vocab Size Expansion & Weight Integrity

You can refer to the original conversion code available in our repository to access the relevant implementation details and code snippets.

You may wonder if the vocab size mismatch (e.g., 32000 vs. 32064) breaks the original weights by causing dimension mismatches between embed_tokens/lm_head and the original weights.

The answer is no. Below are key clarifications:

1. Does expanding embed_tokens affect the attention mechanism?

No. The embed_tokens layer is simply an embedding lookup table (nn.Embedding) that maps token IDs to vectors.
The attention mechanism (e.g., nn.MultiheadAttention or LlamaAttention) does not operate on this lookup table directly. It only cares if the dimension of input hidden states is consistent.
Thus, even if you add more tokens, as long as the hidden_size remains unchanged, the weight shape of the attention layer is not affected at all.

2. Which layers actually change after expansion?

Module	Weight Shape Change	Affects Attention Calculation?
embed_tokens	[vocab_size+2, hidden_size]	❌ No
lm_head	[vocab_size+2, hidden_size]	❌ No
All attention layers	No shape change	✅ Completely unchanged

Analogy for easier understanding

Think of the model as a dictionary:

embed_tokens is the "new word list": You add two new words, but the length of each word’s explanation (hidden_size) stays the same.
Attention is the "reading rule": It only focuses on how the vectors of each word in a sentence interact with each other, not how many words are in the dictionary.

The only "change" lies in input distribution

The original model never encountered embeddings for tokens like <image> or <pad>.
Now these tokens are initialized, and their new vectors appear during the first forward pass.
This is a data-level change, not damage to the model’s parameters.

Summary

Expanding the vocab only changes the "dictionary size", not the "reading rules". The weight shape and calculation logic of the attention mechanism remain completely unchanged. The output layer (lm_head) changes in shape but not in functional logic.

3. What exactly changes in the output layer (lm_head)?

The line below modifies the weight matrix of lm_head (the language modeling head):

model.resize_token_embeddings(config.text_config.vocab_size + 2, pad_shape)

Quick Start

from transformers import LlavaForConditionalGeneration, AutoProcessor
import torch

model_path = "chaoyinshe/llava-med-v1.5-mistral-7b-hf"

model = LlavaForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",   # requires FA2
    device_map="auto"                          # multi-GPU ready
)

processor = AutoProcessor.from_pretrained(model_path)

# Example inference
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is the main finding in this chest X-ray?"}
        ]
    }
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    images=[image], text=prompt, return_tensors="pt"
).to(model.device, torch.bfloat16)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=256)

print(processor.decode(out[0], skip_special_tokens=True))

✅ Training Screenshot

(Note: The image below is for illustrative purposes only — actual training metrics may vary.) 🤗

Downloads last month: 949

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for chaoyinshe/llava-med-v1.5-mistral-7b-hf

Base model

microsoft/llava-med-v1.5-mistral-7b

Finetuned

(3)

this model