WAN 2.2 FP16 Text Encoders (GGUF Format)

High-precision FP16 text encoders for the WAN 2.2 (World Animated Network) video generation model in optimized GGUF format. These encoders provide enhanced text understanding and conditioning for high-quality text-to-video and image-to-video generation.

Model Description

This repository contains the UMT5-XXL text encoder component for WAN 2.2, optimized in FP16 precision using the GGUF format. The text encoder is a critical component that processes text prompts and generates embeddings that condition the video generation process.

Key Features:

  • FP16 Precision: Full 16-bit floating point precision for maximum quality
  • GGUF Format: Efficient memory-mapped format for faster loading and lower memory overhead
  • UMT5-XXL Architecture: Extra-large unified multilingual T5 model for superior text understanding
  • WAN 2.2 Compatible: Designed specifically for WAN 2.2 video generation pipeline

Capabilities:

  • Complex prompt understanding with nuanced semantic comprehension
  • Multilingual text encoding support
  • High-quality conditioning for video generation
  • Efficient inference with optimized format

Repository Contents

wan22-fp16-encoders-gguf/
└── text_encoders/
    └── umt5-xxl-encoder-f16.gguf    (10.59 GB)

Total Repository Size: ~10.6 GB

File Details

File Size Format Precision Purpose
umt5-xxl-encoder-f16.gguf 10.59 GB GGUF FP16 UMT5-XXL text encoder

Hardware Requirements

Minimum Requirements

  • VRAM: 12 GB GPU memory (for encoder only)
  • System RAM: 16 GB
  • Disk Space: 11 GB free space
  • GPU: NVIDIA GPU with CUDA support (recommended)

Recommended Requirements

  • VRAM: 16+ GB GPU memory
  • System RAM: 32 GB
  • Disk Space: 20 GB free space (for encoder + model files)
  • GPU: NVIDIA RTX 3090/4090 or A100

Full WAN 2.2 Pipeline Requirements

When using with complete WAN 2.2 model:

  • VRAM: 40+ GB (encoder + transformer + VAE)
  • System RAM: 64 GB
  • Disk Space: 100+ GB for complete pipeline

Usage Examples

Basic Usage with WAN 2.2 Pipeline

from diffusers import WanPipeline
import torch

# Initialize pipeline with custom encoder path
pipe = WanPipeline.from_pretrained(
    "Lightricks/wan-2.2",
    text_encoder_path="E:/huggingface/wan22-fp16-encoders-gguf/text_encoders/umt5-xxl-encoder-f16.gguf",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Generate video from text prompt
prompt = "A cat walking on a beach at sunset, cinematic lighting, 4k quality"
video = pipe(
    prompt=prompt,
    num_frames=16,
    num_inference_steps=50,
    guidance_scale=7.5
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)

Using with Custom Configuration

from diffusers import WanPipeline
import torch

# Load with custom settings
pipe = WanPipeline.from_pretrained(
    "Lightricks/wan-2.2",
    text_encoder_path="E:/huggingface/wan22-fp16-encoders-gguf/text_encoders/umt5-xxl-encoder-f16.gguf",
    torch_dtype=torch.float16,
    use_safetensors=True
)

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

# Generate with detailed prompt
prompt = """
A professional dancer performing contemporary dance in a modern studio,
smooth camera movement, dramatic lighting, high quality, 24fps
"""

video = pipe(
    prompt=prompt,
    negative_prompt="blurry, low quality, distorted, static",
    num_frames=24,
    height=512,
    width=512,
    num_inference_steps=75,
    guidance_scale=9.0
).frames

export_to_video(video, "dance_video.mp4", fps=24)

Batch Processing Multiple Prompts

prompts = [
    "A serene mountain landscape at dawn",
    "City traffic at night with neon lights",
    "Ocean waves crashing on rocky shore"
]

videos = []
for prompt in prompts:
    video = pipe(
        prompt=prompt,
        num_frames=16,
        num_inference_steps=50
    ).frames
    videos.append(video)

# Save all videos
for idx, video in enumerate(videos):
    export_to_video(video, f"video_{idx:03d}.mp4", fps=8)

Model Specifications

Text Encoder Architecture

  • Model: UMT5-XXL (Unified Multilingual T5 Extra Large)
  • Parameters: ~11 billion parameters
  • Precision: FP16 (16-bit floating point)
  • Format: GGUF (GPT-Generated Unified Format)
  • Context Length: 512 tokens
  • Embedding Dimension: 4096

Format Details

  • GGUF Version: Compatible with llama.cpp and transformers GGUF loaders
  • Quantization: None (full FP16 precision maintained)
  • Memory Mapping: Enabled for efficient loading
  • Tensor Layout: Optimized for GPU inference

Integration

  • Primary Framework: Diffusers (Hugging Face)
  • Compatible Libraries: transformers, llama.cpp, GGML
  • Pipeline: WAN 2.2 text-to-video and image-to-video
  • Device Support: CUDA, CPU (with reduced performance)

Performance Tips

Memory Optimization

# Enable CPU offloading for lower VRAM usage
pipe.enable_model_cpu_offload()

# Enable VAE slicing for memory efficiency
pipe.enable_vae_slicing()

# Use attention slicing for large resolutions
pipe.enable_attention_slicing(slice_size=1)

Speed Optimization

# Use xformers for faster attention (if installed)
pipe.enable_xformers_memory_efficient_attention()

# Reduce inference steps for faster generation
video = pipe(prompt, num_inference_steps=25)  # vs 50-75

# Use lower resolution for faster processing
video = pipe(prompt, height=256, width=256)

Quality Optimization

# Higher inference steps for better quality
video = pipe(prompt, num_inference_steps=100)

# Higher guidance scale for stronger prompt adherence
video = pipe(prompt, guidance_scale=12.0)

# Use FP16 precision for quality vs FP8/INT8
# (This encoder is already FP16)

GGUF Format Benefits

Advantages Over Standard Formats

  • Faster Loading: Memory-mapped file format reduces loading time by 2-3x
  • Lower Memory Overhead: Efficient tensor storage reduces RAM usage during loading
  • Better Compatibility: Works with multiple inference frameworks (transformers, llama.cpp)
  • Simplified Distribution: Single-file format easier to manage and distribute

Performance Characteristics

  • Loading Speed: ~5-10 seconds (vs 30-60 seconds for standard safetensors)
  • Memory Footprint: ~11 GB VRAM (vs ~13 GB for unoptimized formats)
  • Inference Speed: Equivalent to standard FP16 with optimized attention

License

This model is released under a custom license. Please review the license terms before use:

License Type: Other (WAN Model License)

Key Terms:

  • Research and commercial use permitted with attribution
  • Modifications and derivatives allowed
  • Distribution of derivatives must maintain original attribution
  • No warranty provided; use at your own risk

For complete license terms, visit: https://huggingface.co/Lightricks/wan-2.2

Citation

If you use these encoders in your research or projects, please cite:

@misc{wan22-encoders-gguf,
  title={WAN 2.2 FP16 Text Encoders (GGUF Format)},
  author={Lightricks Research},
  year={2024},
  howpublished={\url{https://huggingface.co/Lightricks/wan-2.2}},
  note={UMT5-XXL text encoder for WAN video generation}
}

Related Resources

Official WAN 2.2 Resources

Diffusers Library

GGUF Format

Support and Community

Getting Help

  • Issues: Report issues on the WAN 2.2 repository
  • Discussions: Join Hugging Face community discussions
  • Discord: Lightricks AI community server

Requirements

pip install diffusers>=0.25.0
pip install transformers>=4.36.0
pip install accelerate>=0.25.0
pip install torch>=2.1.0
pip install xformers  # Optional, for memory efficiency

Model Card Contact

For questions, issues, or collaboration inquiries:


Last Updated: October 2024 Model Version: WAN 2.2 Text Encoders FP16 Format Version: GGUF Repository Maintainer: Lightricks Research Team

Downloads last month
6,011
GGUF
Model size
6B params
Architecture
t5encoder
Hardware compatibility
Log In to view the estimation

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including wangkanai/wan22-fp16-encoders-gguf