Eruku - Autoregressive Styled Text Image Generation

Eruku is a state-of-the-art autoregressive model for styled handwritten and typewritten text image generation. Given a style reference image and text to generate, it produces high-quality text images that faithfully replicate the input style.

🌟 Key Features

Zero-shot style transfer: No training required for new styles
No transcription required: Works with just a style image (transcription optional but helps)
Reliable generation: Proper EOG (End of Generation) mechanism prevents artifacts
Arbitrary length: Generate text of any length
High fidelity: Excellent style consistency and text readability
Classifier-Free Guidance: Fine control over generation quality

📦 Installation

pip install torch torchvision transformers diffusers einops pillow

🚀 Quick Start

from transformers import AutoModel
from PIL import Image
import torch

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
    "blowing-up-groundhogs/eruku", 
    trust_remote_code=True
)
model.to(device)
model.eval()

# Load a style image (handwritten/typewritten text sample)
style_image = Image.open("style_sample.png")

# Generate text in that style
result = model.generate_handwriting(
    style_image=style_image,
    gen_text="Hello, World!",
    style_text="",  # Optional: transcription of style image
    cfg_scale=1.25,  # Classifier-free guidance scale
)

# Save the result
result.save("generated.png")

📖 Detailed Usage

Input Format

The model takes three inputs:

Style Image (style_image): A PIL Image containing handwritten or typewritten text that serves as the style reference. The model will replicate this style.
Generation Text (gen_text): The text you want to render in the extracted style.
Style Text (style_text, optional): The transcription of the text in the style image. Providing this helps the model better understand the style, but it's not required.

Parameters

Parameter	Type	Default	Description
`style_image`	PIL.Image	Required	Reference style image
`gen_text`	str	Required	Text to generate
`style_text`	str	`""`	Optional transcription of style image
`cfg_scale`	float	`1.25`	Classifier-free guidance scale
`max_new_tokens`	int	`512`	Maximum generation tokens

CFG Scale Guide

1.0: No guidance (faster but may drift from prompt)
1.25: Recommended default - good balance
1.5-2.0: Stronger adherence to prompt
>2.0: May cause artifacts

🖼️ Example Results

The model excels at:

Handwritten text in various styles (cursive, print, mixed)
Typewritten text with different fonts
Multi-language text (trained primarily on English)
Long text sequences

📊 Model Architecture

Eruku combines:

T5-Large encoder-decoder for text understanding and autoregressive generation
VAE (Variational Autoencoder) for image encoding and decoding
Custom embeddings for style transfer and special tokens (SOS, SOG, EOG)

The model generates images autoregressively, predicting one latent slice at a time until it produces an EOG (End of Generation) token.

🔧 Advanced Usage

Lower-level API

For more control, you can use the lower-level methods:

import torch
from torchvision import transforms as T

# Prepare style image manually
style_img = Image.open("style.png").convert('RGB')
width, height = style_img.size
new_width = int(64 * width / height)
style_img = style_img.resize((new_width, 64), Image.LANCZOS)
style_tensor = T.ToTensor()(style_img).to(device)

# Get model inputs
inputs = model.get_model_inputs(
    style_img=[style_tensor],
    style_len=style_tensor.shape[-1],
    max_img_len=1024*1024
)

# Generate with full control
with torch.inference_mode():
    output_img, special_sequence = model.generate(
        decoder_inputs_embeds_vae=inputs['decoder_inputs_embeds'],
        style_text=["Style text here"],
        gen_text=["Text to generate"],
        cfg_scale=1.25,
        max_new_tokens=512
    )

📚 Citation

If you use Eruku in your research, please cite both papers:

@InProceedings{pippi2025zeroshot,
    author    = {Pippi, Vittorio and Quattrini, Fabio and Cascianelli, Silvia and Tonioni, Alessio and Cucchiara, Rita},
    title     = {Zero-Shot Styled Text Image Generation, but Make It Autoregressive},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {7910-7919}
}

@inproceedings{zaccagnino2026autoregressive,
    author = {Carmine Zaccagnino and Fabio Quattrini and Vittorio Pippi and Silvia Cascianelli and Alessio Tonioni and Rita Cucchiara},
    title = {Autoregressive Styled Text Image Generation, but Make it Reliable},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
    month = {March},
    year = {2026}
}

🔗 Links

📄 Paper: arXiv:2510.23240
🌐 Project Website: eruku.carminezacc.com
🤗 Demo: Hugging Face Space
🎨 VAE Model: blowing-up-groundhogs/emuru_vae

📜 License

This model is released under the Apache 2.0 License.

🙏 Acknowledgments

T5: google-t5/t5-large
VAE: blowing-up-groundhogs/emuru_vae
Training datasets: IAM, CVL, RIMES, FontSquare

Downloads last month: 45

blowing-up-groundhogs
/

eruku