Improved Image Tokenizer
This is an improved image tokenizer of NextStep-1, featuring a fine-tuned decoder with a frozen encoder. The decoder refinement improves performance while preserving robust reconstruction quality. We recommend using this Image Tokenizer for optimal results with NextStep-1 models.
Usage
import torch
from PIL import Image
import numpy as np
import torchvision.transforms as transforms
from autoencoder import AutoencoderKLNextStep
device = "cuda"
dtype = torch.bfloat16
model_path = "/path/to/vae_dir"
vae = AutoencoderKLNextStep.from_pretrained(model_path).to(device=device, dtype=dtype)
pil2tensor = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
    ]
)
image = Image.open("/path/to/image.jpg")
pixel_values = pil2tensor(image).unsqueeze(0).to(device=device, dtype=dtype)
# encode
latents = vae.encode(pixel_values).latent_dist.sample()
# decode
sampled_images = vae.decode(latents).sample
sampled_images = sampled_images.detach().cpu().to(torch.float32)
def tensor_to_pil(tensor):
    image = tensor.detach().cpu().to(torch.float32)
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.mul(255).round().to(dtype=torch.uint8)
    image = image.permute(1, 2, 0).numpy()
    return Image.fromarray(image, mode="RGB")
rec_image = tensor_to_pil(sampled_images[0])
rec_image.save("/path/to/output.jpg")
Evaluation
Reconstruction Performance on ImageNet-1K 256×256
| Tokenizer | Latent Shape | PSNR ↑ | SSIM ↑ | 
|---|---|---|---|
| Discrete Tokenizers | |||
| SBER-MoVQGAN (270M) | 32×32 | 27.04 | 0.74 | 
| LlamaGen | 32×32 | 24.44 | 0.77 | 
| VAR | 680 | 22.12 | 0.62 | 
| TiTok-S-128 | 128 | 17.52 | 0.44 | 
| Sefltok | 1024 | 26.30 | 0.81 | 
| Continuous Tokenizers | |||
| Stable Diffusion 1.5 | 32×32×4 | 25.18 | 0.73 | 
| Stable Diffusion XL | 32×32×4 | 26.22 | 0.77 | 
| Stable Diffusion 3 Medium | 32×32×16 | 30.00 | 0.88 | 
| Flux.1-dev | 32×32×16 | 31.64 | 0.91 | 
| NextStep-1 | 32×32×16 | 30.60 | 0.89 | 
Robustness of NextStep-1-f8ch16-Tokenizer
Impact of Noise Perturbation on Image Tokenizer Performance. The top panel displays quantitative metrics (rFID↓, PSNR↑, and SSIM↑) versus noise intensity. The bottom panel presents qualitative reconstruction examples at noise standard deviations of 0.2 and 0.5.
- Downloads last month
 - 17
 
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	🙋
			
		Ask for provider support