WAN 2.2 FP16 Text Encoders (GGUF Format)

High-precision FP16 text encoders for the WAN 2.2 (World Animated Network) video generation model in optimized GGUF format. These encoders provide enhanced text understanding and conditioning for high-quality text-to-video and image-to-video generation.

Model Description

This repository contains the UMT5-XXL text encoder component for WAN 2.2, optimized in FP16 precision using the GGUF format. The text encoder is a critical component that processes text prompts and generates embeddings that condition the video generation process.

Key Features:

FP16 Precision: Full 16-bit floating point precision for maximum quality
GGUF Format: Efficient memory-mapped format for faster loading and lower memory overhead
UMT5-XXL Architecture: Extra-large unified multilingual T5 model for superior text understanding
WAN 2.2 Compatible: Designed specifically for WAN 2.2 video generation pipeline

Capabilities:

Complex prompt understanding with nuanced semantic comprehension
Multilingual text encoding support
High-quality conditioning for video generation
Efficient inference with optimized format

Repository Contents

wan22-fp16-encoders-gguf/
└── text_encoders/
    └── umt5-xxl-encoder-f16.gguf    (10.59 GB)

Total Repository Size: ~10.6 GB

File Details

File	Size	Format	Precision	Purpose
`umt5-xxl-encoder-f16.gguf`	10.59 GB	GGUF	FP16	UMT5-XXL text encoder

Hardware Requirements

Minimum Requirements

VRAM: 12 GB GPU memory (for encoder only)
System RAM: 16 GB
Disk Space: 11 GB free space
GPU: NVIDIA GPU with CUDA support (recommended)

Recommended Requirements

VRAM: 16+ GB GPU memory
System RAM: 32 GB
Disk Space: 20 GB free space (for encoder + model files)
GPU: NVIDIA RTX 3090/4090 or A100

Full WAN 2.2 Pipeline Requirements

When using with complete WAN 2.2 model:

VRAM: 40+ GB (encoder + transformer + VAE)
System RAM: 64 GB
Disk Space: 100+ GB for complete pipeline

Usage Examples

Basic Usage with WAN 2.2 Pipeline

from diffusers import WanPipeline
import torch

# Initialize pipeline with custom encoder path
pipe = WanPipeline.from_pretrained(
    "Lightricks/wan-2.2",
    text_encoder_path="E:/huggingface/wan22-fp16-encoders-gguf/text_encoders/umt5-xxl-encoder-f16.gguf",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Generate video from text prompt
prompt = "A cat walking on a beach at sunset, cinematic lighting, 4k quality"
video = pipe(
    prompt=prompt,
    num_frames=16,
    num_inference_steps=50,
    guidance_scale=7.5
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)

Using with Custom Configuration

from diffusers import WanPipeline
import torch

# Load with custom settings
pipe = WanPipeline.from_pretrained(
    "Lightricks/wan-2.2",
    text_encoder_path="E:/huggingface/wan22-fp16-encoders-gguf/text_encoders/umt5-xxl-encoder-f16.gguf",
    torch_dtype=torch.float16,
    use_safetensors=True
)

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

# Generate with detailed prompt
prompt = """
A professional dancer performing contemporary dance in a modern studio,
smooth camera movement, dramatic lighting, high quality, 24fps
"""

video = pipe(
    prompt=prompt,
    negative_prompt="blurry, low quality, distorted, static",
    num_frames=24,
    height=512,
    width=512,
    num_inference_steps=75,
    guidance_scale=9.0
).frames

export_to_video(video, "dance_video.mp4", fps=24)

Batch Processing Multiple Prompts

prompts = [
    "A serene mountain landscape at dawn",
    "City traffic at night with neon lights",
    "Ocean waves crashing on rocky shore"
]

videos = []
for prompt in prompts:
    video = pipe(
        prompt=prompt,
        num_frames=16,
        num_inference_steps=50
    ).frames
    videos.append(video)

# Save all videos
for idx, video in enumerate(videos):
    export_to_video(video, f"video_{idx:03d}.mp4", fps=8)

Model Specifications

Text Encoder Architecture

Model: UMT5-XXL (Unified Multilingual T5 Extra Large)
Parameters: ~11 billion parameters
Precision: FP16 (16-bit floating point)
Format: GGUF (GPT-Generated Unified Format)
Context Length: 512 tokens
Embedding Dimension: 4096

Format Details

GGUF Version: Compatible with llama.cpp and transformers GGUF loaders
Quantization: None (full FP16 precision maintained)
Memory Mapping: Enabled for efficient loading
Tensor Layout: Optimized for GPU inference

Integration

Primary Framework: Diffusers (Hugging Face)
Compatible Libraries: transformers, llama.cpp, GGML
Pipeline: WAN 2.2 text-to-video and image-to-video
Device Support: CUDA, CPU (with reduced performance)

Performance Tips

Memory Optimization

# Enable CPU offloading for lower VRAM usage
pipe.enable_model_cpu_offload()

# Enable VAE slicing for memory efficiency
pipe.enable_vae_slicing()

# Use attention slicing for large resolutions
pipe.enable_attention_slicing(slice_size=1)

Speed Optimization

# Use xformers for faster attention (if installed)
pipe.enable_xformers_memory_efficient_attention()

# Reduce inference steps for faster generation
video = pipe(prompt, num_inference_steps=25)  # vs 50-75

# Use lower resolution for faster processing
video = pipe(prompt, height=256, width=256)

Quality Optimization

# Higher inference steps for better quality
video = pipe(prompt, num_inference_steps=100)

# Higher guidance scale for stronger prompt adherence
video = pipe(prompt, guidance_scale=12.0)

# Use FP16 precision for quality vs FP8/INT8
# (This encoder is already FP16)

GGUF Format Benefits

Advantages Over Standard Formats

Faster Loading: Memory-mapped file format reduces loading time by 2-3x
Lower Memory Overhead: Efficient tensor storage reduces RAM usage during loading
Better Compatibility: Works with multiple inference frameworks (transformers, llama.cpp)
Simplified Distribution: Single-file format easier to manage and distribute

Performance Characteristics

Loading Speed: ~5-10 seconds (vs 30-60 seconds for standard safetensors)
Memory Footprint: ~11 GB VRAM (vs ~13 GB for unoptimized formats)
Inference Speed: Equivalent to standard FP16 with optimized attention

License

This model is released under a custom license. Please review the license terms before use:

License Type: Other (WAN Model License)

Key Terms:

Research and commercial use permitted with attribution
Modifications and derivatives allowed
Distribution of derivatives must maintain original attribution
No warranty provided; use at your own risk

For complete license terms, visit: https://huggingface.co/Lightricks/wan-2.2

Citation

If you use these encoders in your research or projects, please cite:

@misc{wan22-encoders-gguf,
  title={WAN 2.2 FP16 Text Encoders (GGUF Format)},
  author={Lightricks Research},
  year={2024},
  howpublished={\url{https://huggingface.co/Lightricks/wan-2.2}},
  note={UMT5-XXL text encoder for WAN video generation}
}

Related Resources

Official WAN 2.2 Resources

Main Model: Lightricks/wan-2.2
Documentation: WAN 2.2 Model Card
Research Paper: WAN: World Animated Network

Diffusers Library

Documentation: Hugging Face Diffusers
Installation: pip install diffusers transformers accelerate
WAN Pipeline Guide: Diffusers WAN Pipeline

GGUF Format

GGML/GGUF Specification: ggerganov/ggml
llama.cpp: ggerganov/llama.cpp
Format Documentation: GGUF Format Spec

Support and Community

Getting Help

Issues: Report issues on the WAN 2.2 repository
Discussions: Join Hugging Face community discussions
Discord: Lightricks AI community server

Requirements

pip install diffusers>=0.25.0
pip install transformers>=4.36.0
pip install accelerate>=0.25.0
pip install torch>=2.1.0
pip install xformers  # Optional, for memory efficiency

Model Card Contact

For questions, issues, or collaboration inquiries:

Email: ai-research@lightricks.com
Website: https://www.lightricks.com/research
Hugging Face: https://huggingface.co/Lightricks

Last Updated: October 2024 Model Version: WAN 2.2 Text Encoders FP16 Format Version: GGUF Repository Maintainer: Lightricks Research Team

Downloads last month: 6,011

GGUF

Model size

6B params

Architecture

t5encoder

Hardware compatibility

16-bit

Collection including wangkanai/wan22-fp16-encoders-gguf

wan-2.2

Collection

WAN 2.2 video models • 27 items • Updated 7 days ago • 1