VINO: A Unified Visual Generator with Interleaved OmniModal Context
Paper
β’ 2601.02358 β’ Published
β’ 30
VINO: A Unified Visual Generator with Interleaved OmniModal Context
π Project Page β’ π Paper β’ π» Code β’ πΊ Demo Video
VINO is a unified image & video generation and editing framework powered by a Vision-Language Model (VLM) and Multi-Modal Diffusion Transformer (MMDiT).
A single set of weights supports:
One model. All visual generation & editing tasks.
This Hugging Face repository provides the official VINO model weights, including:
These weights are intended to be used with:
π https://github.com/SOTAMak1r/VINO-code
VINO depends on the following public checkpoints:
| Component | Source |
|---|---|
| VLM | Qwen/Qwen3-VL-4B-Instruct |
| Video VAE | hunyuanvideo-community/HunyuanVideo |
They will be automatically downloaded by the VINO codebase.
huggingface-cli download SOTAMak1r/VINO-weight \
--local-dir ./checkpoints/SOTAMak1r/VINO-weight \
--local-dir-use-symlinks False
python download.py --ak YOUR_HF_TOKEN
See full instructions in:
π https://github.com/SOTAMak1r/VINO-code
@article{chen2026vino,
title={VINO: A Unified Visual Generator with Interleaved OmniModal Context},
author={Chen, Junyi and He, Tong and Fu, Zhoujie and Wan, Pengfei and Gai, Kun and Ye, Weicai},
journal={arXiv preprint arXiv:2601.02358},
year={2026}
}