SmolVLM2-256M-Married-Qwen3-0.6B

本项目基于 SmolVLM2-256M 与 Qwen3-0.6B 构建，视觉编码器采用 SigLIP2 权重进行初始化，并通过 Connector 与 Qwen3-0.6B 拼接对接，借助Qwen3能力，支持中文描述。在数据方面，模型使用 Objects365 数据集（170 万+ 图像）进行训练。每张图像均配备 1 条 caption 以及 3 组简短的问答，以提升模型在视觉理解与跨模态对话中的能力。在训练策略上，本项目采用 SigLIP2 编码器后 4 层 + Connector + Qwen3-Dora 的方式进行联合训练。需要特别说明的是，本项目主要目的是在开源社区中复现并整合现有技术，探索轻量化多模态模型的组合与适配，并未提出新的方法或创新点。

This project is built upon SmolVLM2-256M and Qwen3-0.6B, with the vision encoder initialized using SigLIP2 weights. A Connector module is employed to align and integrate with Qwen3-0.6B, leveraging Qwen3’s capabilities to support Chinese descriptions.

For training, the model is trained on the Objects365 dataset (1.7M+ images). Each image is annotated with one caption and three short Q&A pairs, aiming to enhance the model’s ability in visual understanding and cross-modal dialogue.

In terms of training strategy, the project adopts a combination of the last four layers of the SigLIP2 encoder, the Connector, and Qwen3-Dora for joint training.

It should be noted that this project primarily aims to reproduce and integrate existing techniques within the open-source community, exploring the composition and adaptation of lightweight multimodal models, and does not propose new methods or innovations.

Resources

感谢chenshaohon的分享，学到了很多
Online Demo: SmolVLM2-256M-Married-Qwen3-0.6B-Demo
知乎：SmolVLM-Married-Qwen3 缝合怪-超小多模态中文模型

Tensorboard

How to get started

from transformers import AutoModelForCausalLM
import re, json, json_repair
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration, AutoModelForCausalLM, AutoTokenizer,SmolVLMProcessor
import torch
from safetensors.torch import load_file
import numpy as np
from PIL import Image
import cv2
import sys
sys.path.append("TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B")
from processor import SmolVLMQwen3Processor

def parse_box_content(text):
    box_match = re.search(r'<box>(.*?)</box>', text, re.DOTALL)
    if not box_match:
        return None
   
    box_content = box_match.group(1).strip()
    # print("box_content", box_content, type(box_content))
    try:
        box_data = json.loads(json_repair.repair_json(box_content))
        print("box_data:", box_data, type(box_data))
        return box_data
    except json.JSONDecodeError as e:
        print(f"JSON parse error: {e}")
        return None
    

def resize_with_padding(image, target_size=384):
    h, w = image.shape[:2]
    
    scale = min(target_size / h, target_size / w)
    new_h, new_w = int(h * scale), int(w * scale)
    
    resized = cv2.resize(image, (new_w, new_h), interpolation=cv2.INTER_AREA)
    
    canvas = np.zeros((target_size, target_size, 3), dtype=np.uint8)

    canvas[0:new_h, 0:new_w] = resized
    
    return canvas



AutoProcessor.register("TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B", SmolVLMQwen3Processor)
processor = AutoProcessor.from_pretrained("TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B")


model = Idefics3ForConditionalGeneration.from_pretrained(
        "TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B",
        torch_dtype=torch.bfloat16,
).to('cuda')


bgr = cv2.imread("./objects365_v1_00045989.jpg")
h, w, _ = bgr.shape
bgr_x512 = resize_with_padding(bgr, 512)
image_pil = Image.fromarray(cv2.cvtColor(bgr_x512, cv2.COLOR_BGR2RGB))

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "简短回复问题."},
            {"type": "image"},
            {"type": "text", "text": "请描述这张图片的内容，并检测其中的人"}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=False).strip()
inputs = processor(text=text, images=image_pil, return_tensors="pt")
inputs = inputs.to('cuda')
generation_args = {
    "input_ids": inputs.input_ids,
    "pixel_values": inputs.pixel_values,
    "attention_mask": inputs.attention_mask,
    "num_return_sequences": 1,
    "no_repeat_ngram_size": 2,
    "max_new_tokens": 1024,
    "min_new_tokens": 16,   
    # "do_sample": False,
    # "temperature": 0.5,
}
output = model.generate(**generation_args)

generated_text = processor.decode(output[0], skip_special_tokens=True).strip()

# Due to the limitation of model parameters, the precision of the detection boxes is relatively low.
bbox = parse_box_content(generated_text)

for item in bbox:
    if "box" not in item:
        continue
    box = item["box"]
    bgr_x512 = cv2.rectangle(bgr, (int(box[0] / 1000 * w), int(box[1] / 1000 * h)), (int(box[2] / 1000 * w), int(box[3] / 1000 * h)), (0 ,0, 255), 2)

cv2.imwrite("visual.jpg", bgr)