README.md · zhangsongbo365/Llama-3.2V-11B-cot-nf4 at main

metadata

license: apache-2.0
language:
  - en
base_model:
  - meta-llama/Llama-3.2-11B-Vision-Instruct

Introduction

This model originates from Xkev/Llama-3.2V-11B-cot. This repository simply quantizes the model into the NF4 format using the bitsandbytes library. All credit goes to the original repository.

Usage

from transformers import MllamaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import time

# Load model
model_id = "zhangsongbo365/Llama-3.2V-11B-cot-nf4" 
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    use_safetensors=True,
    device_map="cuda:0",
    trust_remote_code=True
)

# Load tokenizer
processor = AutoProcessor.from_pretrained(model_id)

# Caption a local image
IMAGE = Image.open("1.png").convert("RGB")  # 改为你的实际图片路径
PROMPT = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Caption this image:
<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

inputs = processor(IMAGE, PROMPT, return_tensors="pt").to(model.device)
prompt_tokens = len(inputs['input_ids'][0])
print(f"Prompt tokens: {prompt_tokens}")

t0 = time.time()
generate_ids = model.generate(**inputs, max_new_tokens=256)
t1 = time.time()
total_time = t1 - t0
generated_tokens = len(generate_ids[0]) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

output = processor.decode(generate_ids[0][prompt_tokens:]).replace('<|eot_id|>', '')