csfufu
/

Revisual-R1-final

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions

🌟 ReVisual-R1 (7B) — Open-Source Multimodal Reasoner

One cold-start, two RL stages, endless reasoning power.

🔑 Highlights

SOTA on 9 tough benchmarks covering visual–math + text reasoning.
Three-Stage SRO Training
1. Text Cold-Start — seed deep reflection
2. Multimodal RL — align vision & logic
3. Text RL — polish fluency & brevity
PAD (Prioritized Advantage Distillation) keeps gradients alive.
Efficient-Length Reward = concise, self-reflective CoT.

📚 Resources

Paper
Code

📌 Citation

@article{chen2025advancing,
  title={Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning},
  author={Chen, Shuang and Guo, Yue and Su, Zhaochen and Li, Yafu and Wu, Yulun and Chen, Jiacheng and Chen, Jiayu and Wang, Weijie and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2506.04207},
  year={2025}
}

Take ReVisual-R1 for a spin and let us know what you build! 🎯

Downloads last month: 38

Safetensors

Model size

8B params

Tensor type

BF16

·

Model tree for csfufu/Revisual-R1-final

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(826)

this model

Quantizations

Space using csfufu/Revisual-R1-final 1

Collection including csfufu/Revisual-R1-final

Revisual-R1

🚀ReVisual-R1 is a 7B open-source multimodal language model that follows a three-stage curriculum—cold-start pre-training, multimodal reinforcement. • 6 items • Updated 22 days ago • 3