YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
DnD-Transformer: β¨ A Spark of Vision-Language Intelligence
π€ Model | π€ Dataset (Coming Soon) | π Paper | π» Github
Updates π
- 2024-10-8: Release models and inference code
- 2024-10-4: Release paper
What's New?
A better AR image genenation paradigm and transformer model structure based on 2D autoregression. It generates images of higher quality without increasing computation budget.
A spark of vision-language intelligence for the first time, enabling unconditional rich-text image generation, outperforming diffusion models like DDPM and Stable Diffusion on dedicated rich-text image datasets, highlighting the distinct advantage of autoregressive models for multimodal modeling.
Models
DnD-Tokenizers (VQ)
Text-Image
Code Size | Link |
---|---|
24x24x1 | π€ |
ImageNet
Code Size | Link | rFID |
---|---|---|
16x16x2 | π€ | 0.92 |
arXiv-Image
coming soon~
DnD-Transformers (GPT)
Text-Image
Code Shape | Model Size | Link |
---|---|---|
24x24x1 | XXL | π€ |
ImageNet
arXiv-Image
coming soon~
Setup
conda create -n DnD python=3.10
conda activate DnD
pip install -r requirements.txt
Inference
Sampling Text-Image Examples
cd ./src
bash ./scripts/sampling_dnd_transformer_text_image.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint
Sampling ImageNet Examples
cd ./src
bash ./scripts/sampling_dnd_transformer_imagenet.sh # edit the address for vq model checkpoint and dnd-transformer checkpoint
# An npz would be saved after genearting 50k images, you can follow https://github.com/openai/guided-diffusion/tree/main/evaluations to compute the generated FID.
Training
Training code and Dataset are coming soon!
Reference
@misc{chen2024sparkvisionlanguageintelligence2dimensional,
title={A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation},
author={Liang Chen and Sinan Tan and Zefan Cai and Weichu Xie and Haozhe Zhao and Yichi Zhang and Junyang Lin and Jinze Bai and Tianyu Liu and Baobao Chang},
year={2024},
eprint={2410.01912},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.01912},
}