Converted from mistral-community/pixtral-12b using BitsAndBytes with NF4 (4-bit) quantization. Not using double quantization. Requires bitsandbytes to load.

Example usage for image captioning:

from transformers import LlavaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import time

# Load model
model_id = "SeanScripts/pixtral-12b-nf4"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    use_safetensors=True,
    device_map="cuda:0"
)
# Load tokenizer
processor = AutoProcessor.from_pretrained(model_id)

# Caption a local image
IMG_URLS = [Image.open("test.png").convert("RGB")]
PROMPT = "<s>[INST]Caption this image:\n[IMG][/INST]"

inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda")
prompt_tokens = len(inputs['input_ids'][0])
print(f"Prompt tokens: {prompt_tokens}")

t0 = time.time()
generate_ids = model.generate(**inputs, max_new_tokens=512)
t1 = time.time()
total_time = t1 - t0
generated_tokens = len(generate_ids[0]) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

On a 4090, this is getting about 10 - 12 tok/s (without flash attention) and the captions seem pretty good, though I haven't tested very many. It uses about 10 GB VRAM.

You can get a set of ComfyUI custom nodes for running this model here: https://github.com/SeanScripts/ComfyUI-PixtralLlamaVision

Downloads last month
574
Safetensors
Model size
7.19B params
Tensor type
F32
FP16
U8
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for SeanScripts/pixtral-12b-nf4

Quantized
(5)
this model

Spaces using SeanScripts/pixtral-12b-nf4 2