Eruku - Autoregressive Styled Text Image Generation

CVPR 2025 WACV 2026 License

Eruku is a state-of-the-art autoregressive model for styled handwritten and typewritten text image generation. Given a style reference image and text to generate, it produces high-quality text images that faithfully replicate the input style.

🌟 Key Features

  • Zero-shot style transfer: No training required for new styles
  • No transcription required: Works with just a style image (transcription optional but helps)
  • Reliable generation: Proper EOG (End of Generation) mechanism prevents artifacts
  • Arbitrary length: Generate text of any length
  • High fidelity: Excellent style consistency and text readability
  • Classifier-Free Guidance: Fine control over generation quality

πŸ“¦ Installation

pip install torch torchvision transformers diffusers einops pillow

πŸš€ Quick Start

from transformers import AutoModel
from PIL import Image
import torch

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
    "blowing-up-groundhogs/eruku", 
    trust_remote_code=True
)
model.to(device)
model.eval()

# Load a style image (handwritten/typewritten text sample)
style_image = Image.open("style_sample.png")

# Generate text in that style
result = model.generate_handwriting(
    style_image=style_image,
    gen_text="Hello, World!",
    style_text="",  # Optional: transcription of style image
    cfg_scale=1.25,  # Classifier-free guidance scale
)

# Save the result
result.save("generated.png")

πŸ“– Detailed Usage

Input Format

The model takes three inputs:

  1. Style Image (style_image): A PIL Image containing handwritten or typewritten text that serves as the style reference. The model will replicate this style.

  2. Generation Text (gen_text): The text you want to render in the extracted style.

  3. Style Text (style_text, optional): The transcription of the text in the style image. Providing this helps the model better understand the style, but it's not required.

Parameters

Parameter Type Default Description
style_image PIL.Image Required Reference style image
gen_text str Required Text to generate
style_text str "" Optional transcription of style image
cfg_scale float 1.25 Classifier-free guidance scale
max_new_tokens int 512 Maximum generation tokens

CFG Scale Guide

  • 1.0: No guidance (faster but may drift from prompt)
  • 1.25: Recommended default - good balance
  • 1.5-2.0: Stronger adherence to prompt
  • >2.0: May cause artifacts

πŸ–ΌοΈ Example Results

The model excels at:

  • Handwritten text in various styles (cursive, print, mixed)
  • Typewritten text with different fonts
  • Multi-language text (trained primarily on English)
  • Long text sequences

πŸ“Š Model Architecture

Eruku combines:

  • T5-Large encoder-decoder for text understanding and autoregressive generation
  • VAE (Variational Autoencoder) for image encoding and decoding
  • Custom embeddings for style transfer and special tokens (SOS, SOG, EOG)

The model generates images autoregressively, predicting one latent slice at a time until it produces an EOG (End of Generation) token.

πŸ”§ Advanced Usage

Lower-level API

For more control, you can use the lower-level methods:

import torch
from torchvision import transforms as T

# Prepare style image manually
style_img = Image.open("style.png").convert('RGB')
width, height = style_img.size
new_width = int(64 * width / height)
style_img = style_img.resize((new_width, 64), Image.LANCZOS)
style_tensor = T.ToTensor()(style_img).to(device)

# Get model inputs
inputs = model.get_model_inputs(
    style_img=[style_tensor],
    style_len=style_tensor.shape[-1],
    max_img_len=1024*1024
)

# Generate with full control
with torch.inference_mode():
    output_img, special_sequence = model.generate(
        decoder_inputs_embeds_vae=inputs['decoder_inputs_embeds'],
        style_text=["Style text here"],
        gen_text=["Text to generate"],
        cfg_scale=1.25,
        max_new_tokens=512
    )

πŸ“š Citation

If you use Eruku in your research, please cite both papers:

@InProceedings{pippi2025zeroshot,
    author    = {Pippi, Vittorio and Quattrini, Fabio and Cascianelli, Silvia and Tonioni, Alessio and Cucchiara, Rita},
    title     = {Zero-Shot Styled Text Image Generation, but Make It Autoregressive},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {7910-7919}
}

@inproceedings{zaccagnino2026autoregressive,
    author = {Carmine Zaccagnino and Fabio Quattrini and Vittorio Pippi and Silvia Cascianelli and Alessio Tonioni and Rita Cucchiara},
    title = {Autoregressive Styled Text Image Generation, but Make it Reliable},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
    month = {March},
    year = {2026}
}

πŸ”— Links

πŸ“œ License

This model is released under the Apache 2.0 License.

πŸ™ Acknowledgments

  • T5: google-t5/t5-large
  • VAE: blowing-up-groundhogs/emuru_vae
  • Training datasets: IAM, CVL, RIMES, FontSquare
Downloads last month
45
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using blowing-up-groundhogs/eruku 1