Sarashina2.2-Vision-3B
Sarashina2.2-Vision-3B is a Japanese Large Vision Language Model trained by SB Intuitions.
This model is based on Sarashina2.2-3B-Instruct and Image Encoder of SigLIP.
Model Performance
Japanese Performance
| Model | Params(B) | BussinessSlide VQA*1 | Heron-Bench*1 | JDocQA*1 | JMMMU |
|---|---|---|---|---|---|
| Sarashina2.2-Vision-3B | 3.8 | 3.932 | 3.214 | 3.327 | 0.486 |
| Qwen2.5-VL-3B-Instruct | 3.8 | 3.516 | 2.000 | 3.019 | 0.450 |
| Qwen3-VL-4B-Instruct | 4.4 | 4.105 | 2.330 | 3.596 | 0.493 |
| InternVL3_5-4B | 4.7 | 3.311 | 1.893 | 2.626 | 0.437 |
| Sarashina2-Vision-14B | 14.4 | 3.110 | 2.184 | -*2 | 0.432 |
| Stockmark-2-VL-100B-beta | 96.5 | 3.973 | 2.563 | 3.168 | -*2 |
*1. gpt-oss-120b was used for LLM-as-a-Judge.
*2. These scores cannot be measured because some input data exceeds the model's max_position_embeddings.
English Performance
| Model | Params(B) | DocVQA | InfoVQA | RealworldQA |
|---|---|---|---|---|
| Sarashina2.2-Vision-3B | 3.8 | 0.831 | 0.567 | 0.625 |
| Qwen2.5-VL-3B-Instruct | 3.8 | 0.924 | 0.750 | 0.586 |
| Qwen3-VL-4B-Instruct | 4.4 | 0.948 | 0.798 | 0.712 |
| InternVL3_5-4B | 4.7 | 0.823 | 0.541 | 0.553 |
| Sarashina2-Vision-14B | 14.4 | 0.729 | 0.490 | 0.519 |
How to use
1. Install dependencies
pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate
2. Inference
The following script loads the model and allows inference.
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, set_seed
# Define model path
model_path = "sbintuitions/sarashina2.2-vision-3b"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
set_seed(42)
image_url = "https://huggingface.co/sbintuitions/sarashina2.2-vision-3b/resolve/main/sample.jpg"
message = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image_url,
},
{
"type": "text",
"text": "ใใใฏใฉใใงๆฎใฃใๅ็ใงใใ๏ผ",
},
],
}
]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <|user|><|prefix|><|file|><|suffix|>ใใใฏใฉใใงๆฎใฃใๅ็ใงใใ๏ผ</s><|assistant|>"""
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
inputs = processor(
text=[text_prompt],
images=[image],
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
output_ids = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.2,
)
generated_ids = [
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""
ใใฎๅ็ใฏใ**้ๅพๆธฉๆณๆฌ้คจ๏ผใฉใใใใใใใปใใใ๏ผ** ใฎๅ
ฅใๅฃใๅคๆฏใงๆฎๅฝฑใใๅ็ใงใใ
---
ๅ ดๆใฎ่ฉณ็ดฐ๏ผ
- **ๅ็งฐ**๏ผ้ๅพๆธฉๆณๆฌ้คจ๏ผDogo Onsen Honkan๏ผ
- **ๆๅจๅฐ**๏ผใ790-0842 ๆๅช็ๆพๅฑฑๅธ้ๅพๆนฏไน็บ1ไธ็ฎ3็ช5ๅท
- **ใขใฏใปใน**๏ผJRๆพๅฑฑ้ง
ใใๅธๅ
้ป่ปใ้ๅพๆธฉๆณ้ง
ใไธ่ปใใ
- **็นๅพด**๏ผๆฅๆฌๆๅคใฎๆธฉๆณใฎไธใคใจใใฆ็ฅใใใใ้ๅพๆธฉๆณใใฎไธญๅฟ็ใชๆฝ่จญใๅฝใฎ้่ฆๆๅ่ฒกใซใๆๅฎใใใฆใใพใใ
---
ๅ็ใฎ็นๅพดใใๅคๆญใใ็็ฑ๏ผ
- ๅปบ็ฉใฎๅฑๆ นใ่ฃ
้ฃพใไผ็ตฑ็ใชๅ้ขจๅปบ็ฏใงใใ้ๅพๆธฉๆณใใฎ็ๆฟใ็ฎ็ซใคใ
- ๅ
ฅๅฃใฎๅใๅนใซใฏใ้ๅพใใ้ๅพใใจๆธใใใฆใใใ็ฝใ้ณณๅฐใฎๆจกๆงใๆใใใฆใใ โ ้ๅพๆธฉๆณใฎ่ฑกๅพด็ใใถใคใณใ
- ๅคใฎ็
งๆใจ็ณ็ฏ็ฑ ใๆ็ฏ้ขจใฎ็ฏใใๆฅๆฌใฎๆธฉๆณๅฐใใใ้ฐๅฒๆฐใ้ธใๅบใใฆใใใ
- ็ๆฟใซใ้ๅพๆธฉๆณใใฎๆๅญใๆ็ขบใซ่กจ็คบใใใฆใใใ
---
่ฃ่ถณๆ
ๅ ฑ๏ผ
้ๅพๆธฉๆณๆฌ้คจใฏใๅค็ฎๆผฑ็ณใฎๅฐ่ชฌใๅใฃใกใใใใฎ่ๅฐใจใใฆใๆๅใงใๅคใใฎ่ฆณๅ
ๅฎขใ่จชใใไบบๆฐในใใใใงใใใพใใ2020ๅนดใซใชใใฅใผใขใซใใใ็พไปฃ็ใช่จญๅใๅฐๅ
ฅใใใฆใใพใใใๅค่ฆณใฏไผ็ตฑใๆฎใใฆใใพใใ
---
ใใฃใฆใใใฎๅ็ใฏ **ๆๅช็ๆพๅฑฑๅธใซใใใ้ๅพๆธฉๆณๆฌ้คจใใฎๅคๆฏ** ใงใใ
"""
Training
Sarashina2.2-Vision-3B is created through the following five-stage training process:
PreTrain
- Projector Warmup: To bridge the gap between the text and image embedding spaces within the LLM
- Vision Encoder Pretraining: To enhance image comprehension, especially for understanding Japan-specific images and text
- Full Model Pretraining: To enhance the model's unified understanding of images and language using interleaved data
PostTrain
- Supervised Fine-Tuning(SFT): To improve the model's ability to follow instructions and respond appropriately to user prompts
- Mixed Preference Optimization(MPO): To align the model's outputs with user preferences, ensuring it generates more desirable responses
Limitations
This model has limited safety training. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs. Before using it, we would like developers to tune models based on human preferences and safety considerations.
LICENSE
- Downloads last month
- 1,837
Model tree for sbintuitions/sarashina2.2-vision-3b
Base model
sbintuitions/sarashina2.2-3b