SmolVLM2-256M-Married-Qwen3-0.6B
本项目基于 SmolVLM2-256M 与 Qwen3-0.6B 构建,视觉编码器采用 SigLIP2 权重进行初始化,并通过 Connector 与 Qwen3-0.6B 拼接对接,借助Qwen3能力,支持中文描述。 在数据方面,模型使用 Objects365 数据集(170 万+ 图像) 进行训练。每张图像均配备 1 条 caption 以及 3 组简短的问答,以提升模型在视觉理解与跨模态对话中的能力。 在训练策略上,本项目采用 SigLIP2 编码器后 4 层 + Connector + Qwen3-Dora 的方式进行联合训练。 需要特别说明的是,本项目主要目的是在开源社区中复现并整合现有技术,探索轻量化多模态模型的组合与适配,并未提出新的方法或创新点。
This project is built upon SmolVLM2-256M and Qwen3-0.6B, with the vision encoder initialized using SigLIP2 weights. A Connector module is employed to align and integrate with Qwen3-0.6B, leveraging Qwen3’s capabilities to support Chinese descriptions.
For training, the model is trained on the Objects365 dataset (1.7M+ images). Each image is annotated with one caption and three short Q&A pairs, aiming to enhance the model’s ability in visual understanding and cross-modal dialogue.
In terms of training strategy, the project adopts a combination of the last four layers of the SigLIP2 encoder, the Connector, and Qwen3-Dora for joint training.
It should be noted that this project primarily aims to reproduce and integrate existing techniques within the open-source community, exploring the composition and adaptation of lightweight multimodal models, and does not propose new methods or innovations.
Resources
- 感谢chenshaohon的分享,学到了很多
- Online Demo: SmolVLM2-256M-Married-Qwen3-0.6B-Demo
- 知乎:SmolVLM-Married-Qwen3 缝合怪-超小多模态中文模型
Tensorboard
How to get started
from transformers import AutoModelForCausalLM
import re, json, json_repair
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration, AutoModelForCausalLM, AutoTokenizer,SmolVLMProcessor
import torch
from safetensors.torch import load_file
import numpy as np
from PIL import Image
import cv2
import sys
sys.path.append("TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B")
from processor import SmolVLMQwen3Processor
def parse_box_content(text):
box_match = re.search(r'<box>(.*?)</box>', text, re.DOTALL)
if not box_match:
return None
box_content = box_match.group(1).strip()
# print("box_content", box_content, type(box_content))
try:
box_data = json.loads(json_repair.repair_json(box_content))
print("box_data:", box_data, type(box_data))
return box_data
except json.JSONDecodeError as e:
print(f"JSON parse error: {e}")
return None
def resize_with_padding(image, target_size=384):
h, w = image.shape[:2]
scale = min(target_size / h, target_size / w)
new_h, new_w = int(h * scale), int(w * scale)
resized = cv2.resize(image, (new_w, new_h), interpolation=cv2.INTER_AREA)
canvas = np.zeros((target_size, target_size, 3), dtype=np.uint8)
canvas[0:new_h, 0:new_w] = resized
return canvas
AutoProcessor.register("TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B", SmolVLMQwen3Processor)
processor = AutoProcessor.from_pretrained("TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B")
model = Idefics3ForConditionalGeneration.from_pretrained(
"TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B",
torch_dtype=torch.bfloat16,
).to('cuda')
bgr = cv2.imread("./objects365_v1_00045989.jpg")
h, w, _ = bgr.shape
bgr_x512 = resize_with_padding(bgr, 512)
image_pil = Image.fromarray(cv2.cvtColor(bgr_x512, cv2.COLOR_BGR2RGB))
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "简短回复问题."},
{"type": "image"},
{"type": "text", "text": "请描述这张图片的内容,并检测其中的人"}
]
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=False).strip()
inputs = processor(text=text, images=image_pil, return_tensors="pt")
inputs = inputs.to('cuda')
generation_args = {
"input_ids": inputs.input_ids,
"pixel_values": inputs.pixel_values,
"attention_mask": inputs.attention_mask,
"num_return_sequences": 1,
"no_repeat_ngram_size": 2,
"max_new_tokens": 1024,
"min_new_tokens": 16,
# "do_sample": False,
# "temperature": 0.5,
}
output = model.generate(**generation_args)
generated_text = processor.decode(output[0], skip_special_tokens=True).strip()
# Due to the limitation of model parameters, the precision of the detection boxes is relatively low.
bbox = parse_box_content(generated_text)
for item in bbox:
if "box" not in item:
continue
box = item["box"]
bgr_x512 = cv2.rectangle(bgr, (int(box[0] / 1000 * w), int(box[1] / 1000 * h)), (int(box[2] / 1000 * w), int(box[3] / 1000 * h)), (0 ,0, 255), 2)
cv2.imwrite("visual.jpg", bgr)
Some Demo
- Downloads last month
- 91
Model tree for TalkUHulk/SmolVLM2-256M-Married-Qwen3-0.6B
Base model
HuggingFaceTB/SmolLM2-135M








