Sarashina2.2-Vision-3B

Sarashina2.2-Vision-3B is a Japanese Large Vision Language Model trained by SB Intuitions.

This model is based on Sarashina2.2-3B-Instruct and Image Encoder of SigLIP.

Model Performance

Japanese Performance

Model Params(B) BussinessSlide VQA*1 Heron-Bench*1 JDocQA*1 JMMMU
Sarashina2.2-Vision-3B 3.8 3.932 3.214 3.327 0.486
Qwen2.5-VL-3B-Instruct 3.8 3.516 2.000 3.019 0.450
Qwen3-VL-4B-Instruct 4.4 4.105 2.330 3.596 0.493
InternVL3_5-4B 4.7 3.311 1.893 2.626 0.437
Sarashina2-Vision-14B 14.4 3.110 2.184 -*2 0.432
Stockmark-2-VL-100B-beta 96.5 3.973 2.563 3.168 -*2

*1. gpt-oss-120b was used for LLM-as-a-Judge.

*2. These scores cannot be measured because some input data exceeds the model's max_position_embeddings.

English Performance

Model Params(B) DocVQA InfoVQA RealworldQA
Sarashina2.2-Vision-3B 3.8 0.831 0.567 0.625
Qwen2.5-VL-3B-Instruct 3.8 0.924 0.750 0.586
Qwen3-VL-4B-Instruct 4.4 0.948 0.798 0.712
InternVL3_5-4B 4.7 0.823 0.541 0.553
Sarashina2-Vision-14B 14.4 0.729 0.490 0.519

How to use

1. Install dependencies

pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate

2. Inference

The following script loads the model and allows inference.

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, set_seed

# Define model path
model_path = "sbintuitions/sarashina2.2-vision-3b"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)

image_url = "https://huggingface.co/sbintuitions/sarashina2.2-vision-3b/resolve/main/sample.jpg"
message = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image_url,
            },
            {
                "type": "text",
                "text": "ใ“ใ‚Œใฏใฉใ“ใงๆ’ฎใฃใŸๅ†™็œŸใงใ™ใ‹๏ผŸ",
            },
        ],
    }
]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <|user|><|prefix|><|file|><|suffix|>ใ“ใ‚Œใฏใฉใ“ใงๆ’ฎใฃใŸๅ†™็œŸใงใ™ใ‹๏ผŸ</s><|assistant|>"""

image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
output_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.2,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""
ใ“ใฎๅ†™็œŸใฏใ€**้“ๅพŒๆธฉๆณ‰ๆœฌ้คจ๏ผˆใฉใ†ใ”ใŠใ‚“ใ›ใ‚“ใปใ‚“ใ‹ใ‚“๏ผ‰** ใฎๅ…ฅใ‚Šๅฃใ‚’ๅคœๆ™ฏใงๆ’ฎๅฝฑใ—ใŸๅ†™็œŸใงใ™ใ€‚

---
 ๅ ดๆ‰€ใฎ่ฉณ็ดฐ๏ผš
- **ๅ็งฐ**๏ผš้“ๅพŒๆธฉๆณ‰ๆœฌ้คจ๏ผˆDogo Onsen Honkan๏ผ‰
- **ๆ‰€ๅœจๅœฐ**๏ผšใ€’790-0842 ๆ„›ๅช›็œŒๆพๅฑฑๅธ‚้“ๅพŒๆนฏไน‹็”บ1ไธ็›ฎ3็•ช5ๅท
- **ใ‚ขใ‚ฏใ‚ปใ‚น**๏ผšJRๆพๅฑฑ้ง…ใ‹ใ‚‰ๅธ‚ๅ†…้›ป่ปŠใ€Œ้“ๅพŒๆธฉๆณ‰้ง…ใ€ไธ‹่ปŠใ™ใ
- **็‰นๅพด**๏ผšๆ—ฅๆœฌๆœ€ๅคใฎๆธฉๆณ‰ใฎไธ€ใคใจใ—ใฆ็Ÿฅใ‚‰ใ‚Œใ‚‹ใ€Œ้“ๅพŒๆธฉๆณ‰ใ€ใฎไธญๅฟƒ็š„ใชๆ–ฝ่จญใ€‚ๅ›ฝใฎ้‡่ฆๆ–‡ๅŒ–่ฒกใซใ‚‚ๆŒ‡ๅฎšใ•ใ‚Œใฆใ„ใพใ™ใ€‚

---
 ๅ†™็œŸใฎ็‰นๅพดใ‹ใ‚‰ๅˆคๆ–ญใ—ใŸ็†็”ฑ๏ผš
- ๅปบ็‰ฉใฎๅฑ‹ๆ นใ‚„่ฃ…้ฃพใŒไผ็ตฑ็š„ใชๅ’Œ้ขจๅปบ็ฏ‰ใงใ€ใ€Œ้“ๅพŒๆธฉๆณ‰ใ€ใฎ็œ‹ๆฟใŒ็›ฎ็ซ‹ใคใ€‚
- ๅ…ฅๅฃใฎๅž‚ใ‚Œๅน•ใซใฏใ€Œ้“ๅพŒใ€ใ€Œ้“ๅพŒใ€ใจๆ›ธใ‹ใ‚ŒใฆใŠใ‚Šใ€็™ฝใ„้ณณๅ‡ฐใฎๆจกๆง˜ใŒๆใ‹ใ‚Œใฆใ„ใ‚‹ โ†’ ้“ๅพŒๆธฉๆณ‰ใฎ่ฑกๅพด็š„ใƒ‡ใ‚ถใ‚คใƒณใ€‚
- ๅคœใฎ็…งๆ˜Žใจ็Ÿณ็ฏ็ฑ ใ€ๆ็ฏ้ขจใฎ็ฏใ‚ŠใŒๆ—ฅๆœฌใฎๆธฉๆณ‰ๅœฐใ‚‰ใ—ใ„้›ฐๅ›ฒๆฐ—ใ‚’้†ธใ—ๅ‡บใ—ใฆใ„ใ‚‹ใ€‚
- ็œ‹ๆฟใซใ€Œ้“ๅพŒๆธฉๆณ‰ใ€ใฎๆ–‡ๅญ—ใŒๆ˜Ž็ขบใซ่กจ็คบใ•ใ‚Œใฆใ„ใ‚‹ใ€‚

---
 ่ฃœ่ถณๆƒ…ๅ ฑ๏ผš
้“ๅพŒๆธฉๆณ‰ๆœฌ้คจใฏใ€ๅค็›ฎๆผฑ็Ÿณใฎๅฐ่ชฌใ€ŽๅŠใฃใกใ‚ƒใ‚“ใ€ใฎ่ˆžๅฐใจใ—ใฆใ‚‚ๆœ‰ๅใงใ€ๅคšใใฎ่ฆณๅ…‰ๅฎขใŒ่จชใ‚Œใ‚‹ไบบๆฐ—ใ‚นใƒใƒƒใƒˆใงใ™ใ€‚ใพใŸใ€2020ๅนดใซใƒชใƒ‹ใƒฅใƒผใ‚ขใƒซใ•ใ‚Œใ€็พไปฃ็š„ใช่จญๅ‚™ใ‚‚ๅฐŽๅ…ฅใ•ใ‚Œใฆใ„ใพใ™ใŒใ€ๅค–่ฆณใฏไผ็ตฑใ‚’ๆฎ‹ใ—ใฆใ„ใพใ™ใ€‚

---
ใ‚ˆใฃใฆใ€ใ“ใฎๅ†™็œŸใฏ **ๆ„›ๅช›็œŒๆพๅฑฑๅธ‚ใซใ‚ใ‚‹ใ€Œ้“ๅพŒๆธฉๆณ‰ๆœฌ้คจใ€ใฎๅคœๆ™ฏ** ใงใ™ใ€‚
"""

Training

Sarashina2.2-Vision-3B is created through the following five-stage training process:

PreTrain

  1. Projector Warmup: To bridge the gap between the text and image embedding spaces within the LLM
  2. Vision Encoder Pretraining: To enhance image comprehension, especially for understanding Japan-specific images and text
  3. Full Model Pretraining: To enhance the model's unified understanding of images and language using interleaved data

PostTrain

  1. Supervised Fine-Tuning(SFT): To improve the model's ability to follow instructions and respond appropriately to user prompts
  2. Mixed Preference Optimization(MPO): To align the model's outputs with user preferences, ensuring it generates more desirable responses

Limitations

This model has limited safety training. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs. Before using it, we would like developers to tune models based on human preferences and safety considerations.

LICENSE

MIT License

Downloads last month
1,837
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sbintuitions/sarashina2.2-vision-3b