Eruku - Autoregressive Styled Text Image Generation
Eruku is a state-of-the-art autoregressive model for styled handwritten and typewritten text image generation. Given a style reference image and text to generate, it produces high-quality text images that faithfully replicate the input style.
π Key Features
- Zero-shot style transfer: No training required for new styles
- No transcription required: Works with just a style image (transcription optional but helps)
- Reliable generation: Proper EOG (End of Generation) mechanism prevents artifacts
- Arbitrary length: Generate text of any length
- High fidelity: Excellent style consistency and text readability
- Classifier-Free Guidance: Fine control over generation quality
π¦ Installation
pip install torch torchvision transformers diffusers einops pillow
π Quick Start
from transformers import AutoModel
from PIL import Image
import torch
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
"blowing-up-groundhogs/eruku",
trust_remote_code=True
)
model.to(device)
model.eval()
# Load a style image (handwritten/typewritten text sample)
style_image = Image.open("style_sample.png")
# Generate text in that style
result = model.generate_handwriting(
style_image=style_image,
gen_text="Hello, World!",
style_text="", # Optional: transcription of style image
cfg_scale=1.25, # Classifier-free guidance scale
)
# Save the result
result.save("generated.png")
π Detailed Usage
Input Format
The model takes three inputs:
Style Image (
style_image): A PIL Image containing handwritten or typewritten text that serves as the style reference. The model will replicate this style.Generation Text (
gen_text): The text you want to render in the extracted style.Style Text (
style_text, optional): The transcription of the text in the style image. Providing this helps the model better understand the style, but it's not required.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
style_image |
PIL.Image | Required | Reference style image |
gen_text |
str | Required | Text to generate |
style_text |
str | "" |
Optional transcription of style image |
cfg_scale |
float | 1.25 |
Classifier-free guidance scale |
max_new_tokens |
int | 512 |
Maximum generation tokens |
CFG Scale Guide
1.0: No guidance (faster but may drift from prompt)1.25: Recommended default - good balance1.5-2.0: Stronger adherence to prompt>2.0: May cause artifacts
πΌοΈ Example Results
The model excels at:
- Handwritten text in various styles (cursive, print, mixed)
- Typewritten text with different fonts
- Multi-language text (trained primarily on English)
- Long text sequences
π Model Architecture
Eruku combines:
- T5-Large encoder-decoder for text understanding and autoregressive generation
- VAE (Variational Autoencoder) for image encoding and decoding
- Custom embeddings for style transfer and special tokens (SOS, SOG, EOG)
The model generates images autoregressively, predicting one latent slice at a time until it produces an EOG (End of Generation) token.
π§ Advanced Usage
Lower-level API
For more control, you can use the lower-level methods:
import torch
from torchvision import transforms as T
# Prepare style image manually
style_img = Image.open("style.png").convert('RGB')
width, height = style_img.size
new_width = int(64 * width / height)
style_img = style_img.resize((new_width, 64), Image.LANCZOS)
style_tensor = T.ToTensor()(style_img).to(device)
# Get model inputs
inputs = model.get_model_inputs(
style_img=[style_tensor],
style_len=style_tensor.shape[-1],
max_img_len=1024*1024
)
# Generate with full control
with torch.inference_mode():
output_img, special_sequence = model.generate(
decoder_inputs_embeds_vae=inputs['decoder_inputs_embeds'],
style_text=["Style text here"],
gen_text=["Text to generate"],
cfg_scale=1.25,
max_new_tokens=512
)
π Citation
If you use Eruku in your research, please cite both papers:
@InProceedings{pippi2025zeroshot,
author = {Pippi, Vittorio and Quattrini, Fabio and Cascianelli, Silvia and Tonioni, Alessio and Cucchiara, Rita},
title = {Zero-Shot Styled Text Image Generation, but Make It Autoregressive},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
pages = {7910-7919}
}
@inproceedings{zaccagnino2026autoregressive,
author = {Carmine Zaccagnino and Fabio Quattrini and Vittorio Pippi and Silvia Cascianelli and Alessio Tonioni and Rita Cucchiara},
title = {Autoregressive Styled Text Image Generation, but Make it Reliable},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
month = {March},
year = {2026}
}
π Links
- π Paper: arXiv:2510.23240
- π Project Website: eruku.carminezacc.com
- π€ Demo: Hugging Face Space
- π¨ VAE Model: blowing-up-groundhogs/emuru_vae
π License
This model is released under the Apache 2.0 License.
π Acknowledgments
- T5: google-t5/t5-large
- VAE: blowing-up-groundhogs/emuru_vae
- Training datasets: IAM, CVL, RIMES, FontSquare
- Downloads last month
- 45