WAN 2.2 FP16 Text Encoders (GGUF Format)
High-precision FP16 text encoders for the WAN 2.2 (World Animated Network) video generation model in optimized GGUF format. These encoders provide enhanced text understanding and conditioning for high-quality text-to-video and image-to-video generation.
Model Description
This repository contains the UMT5-XXL text encoder component for WAN 2.2, optimized in FP16 precision using the GGUF format. The text encoder is a critical component that processes text prompts and generates embeddings that condition the video generation process.
Key Features:
- FP16 Precision: Full 16-bit floating point precision for maximum quality
- GGUF Format: Efficient memory-mapped format for faster loading and lower memory overhead
- UMT5-XXL Architecture: Extra-large unified multilingual T5 model for superior text understanding
- WAN 2.2 Compatible: Designed specifically for WAN 2.2 video generation pipeline
Capabilities:
- Complex prompt understanding with nuanced semantic comprehension
- Multilingual text encoding support
- High-quality conditioning for video generation
- Efficient inference with optimized format
Repository Contents
wan22-fp16-encoders-gguf/
βββ text_encoders/
βββ umt5-xxl-encoder-f16.gguf (10.59 GB)
Total Repository Size: ~10.6 GB
File Details
| File | Size | Format | Precision | Purpose |
|---|---|---|---|---|
umt5-xxl-encoder-f16.gguf |
10.59 GB | GGUF | FP16 | UMT5-XXL text encoder |
Hardware Requirements
Minimum Requirements
- VRAM: 12 GB GPU memory (for encoder only)
- System RAM: 16 GB
- Disk Space: 11 GB free space
- GPU: NVIDIA GPU with CUDA support (recommended)
Recommended Requirements
- VRAM: 16+ GB GPU memory
- System RAM: 32 GB
- Disk Space: 20 GB free space (for encoder + model files)
- GPU: NVIDIA RTX 3090/4090 or A100
Full WAN 2.2 Pipeline Requirements
When using with complete WAN 2.2 model:
- VRAM: 40+ GB (encoder + transformer + VAE)
- System RAM: 64 GB
- Disk Space: 100+ GB for complete pipeline
Usage Examples
Basic Usage with WAN 2.2 Pipeline
from diffusers import WanPipeline
import torch
# Initialize pipeline with custom encoder path
pipe = WanPipeline.from_pretrained(
"Lightricks/wan-2.2",
text_encoder_path="E:/huggingface/wan22-fp16-encoders-gguf/text_encoders/umt5-xxl-encoder-f16.gguf",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
# Generate video from text prompt
prompt = "A cat walking on a beach at sunset, cinematic lighting, 4k quality"
video = pipe(
prompt=prompt,
num_frames=16,
num_inference_steps=50,
guidance_scale=7.5
).frames
# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)
Using with Custom Configuration
from diffusers import WanPipeline
import torch
# Load with custom settings
pipe = WanPipeline.from_pretrained(
"Lightricks/wan-2.2",
text_encoder_path="E:/huggingface/wan22-fp16-encoders-gguf/text_encoders/umt5-xxl-encoder-f16.gguf",
torch_dtype=torch.float16,
use_safetensors=True
)
# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()
# Generate with detailed prompt
prompt = """
A professional dancer performing contemporary dance in a modern studio,
smooth camera movement, dramatic lighting, high quality, 24fps
"""
video = pipe(
prompt=prompt,
negative_prompt="blurry, low quality, distorted, static",
num_frames=24,
height=512,
width=512,
num_inference_steps=75,
guidance_scale=9.0
).frames
export_to_video(video, "dance_video.mp4", fps=24)
Batch Processing Multiple Prompts
prompts = [
"A serene mountain landscape at dawn",
"City traffic at night with neon lights",
"Ocean waves crashing on rocky shore"
]
videos = []
for prompt in prompts:
video = pipe(
prompt=prompt,
num_frames=16,
num_inference_steps=50
).frames
videos.append(video)
# Save all videos
for idx, video in enumerate(videos):
export_to_video(video, f"video_{idx:03d}.mp4", fps=8)
Model Specifications
Text Encoder Architecture
- Model: UMT5-XXL (Unified Multilingual T5 Extra Large)
- Parameters: ~11 billion parameters
- Precision: FP16 (16-bit floating point)
- Format: GGUF (GPT-Generated Unified Format)
- Context Length: 512 tokens
- Embedding Dimension: 4096
Format Details
- GGUF Version: Compatible with llama.cpp and transformers GGUF loaders
- Quantization: None (full FP16 precision maintained)
- Memory Mapping: Enabled for efficient loading
- Tensor Layout: Optimized for GPU inference
Integration
- Primary Framework: Diffusers (Hugging Face)
- Compatible Libraries: transformers, llama.cpp, GGML
- Pipeline: WAN 2.2 text-to-video and image-to-video
- Device Support: CUDA, CPU (with reduced performance)
Performance Tips
Memory Optimization
# Enable CPU offloading for lower VRAM usage
pipe.enable_model_cpu_offload()
# Enable VAE slicing for memory efficiency
pipe.enable_vae_slicing()
# Use attention slicing for large resolutions
pipe.enable_attention_slicing(slice_size=1)
Speed Optimization
# Use xformers for faster attention (if installed)
pipe.enable_xformers_memory_efficient_attention()
# Reduce inference steps for faster generation
video = pipe(prompt, num_inference_steps=25) # vs 50-75
# Use lower resolution for faster processing
video = pipe(prompt, height=256, width=256)
Quality Optimization
# Higher inference steps for better quality
video = pipe(prompt, num_inference_steps=100)
# Higher guidance scale for stronger prompt adherence
video = pipe(prompt, guidance_scale=12.0)
# Use FP16 precision for quality vs FP8/INT8
# (This encoder is already FP16)
GGUF Format Benefits
Advantages Over Standard Formats
- Faster Loading: Memory-mapped file format reduces loading time by 2-3x
- Lower Memory Overhead: Efficient tensor storage reduces RAM usage during loading
- Better Compatibility: Works with multiple inference frameworks (transformers, llama.cpp)
- Simplified Distribution: Single-file format easier to manage and distribute
Performance Characteristics
- Loading Speed: ~5-10 seconds (vs 30-60 seconds for standard safetensors)
- Memory Footprint: ~11 GB VRAM (vs ~13 GB for unoptimized formats)
- Inference Speed: Equivalent to standard FP16 with optimized attention
License
This model is released under a custom license. Please review the license terms before use:
License Type: Other (WAN Model License)
Key Terms:
- Research and commercial use permitted with attribution
- Modifications and derivatives allowed
- Distribution of derivatives must maintain original attribution
- No warranty provided; use at your own risk
For complete license terms, visit: https://huggingface.co/Lightricks/wan-2.2
Citation
If you use these encoders in your research or projects, please cite:
@misc{wan22-encoders-gguf,
title={WAN 2.2 FP16 Text Encoders (GGUF Format)},
author={Lightricks Research},
year={2024},
howpublished={\url{https://huggingface.co/Lightricks/wan-2.2}},
note={UMT5-XXL text encoder for WAN video generation}
}
Related Resources
Official WAN 2.2 Resources
- Main Model: Lightricks/wan-2.2
- Documentation: WAN 2.2 Model Card
- Research Paper: WAN: World Animated Network
Diffusers Library
- Documentation: Hugging Face Diffusers
- Installation:
pip install diffusers transformers accelerate - WAN Pipeline Guide: Diffusers WAN Pipeline
GGUF Format
- GGML/GGUF Specification: ggerganov/ggml
- llama.cpp: ggerganov/llama.cpp
- Format Documentation: GGUF Format Spec
Support and Community
Getting Help
- Issues: Report issues on the WAN 2.2 repository
- Discussions: Join Hugging Face community discussions
- Discord: Lightricks AI community server
Requirements
pip install diffusers>=0.25.0
pip install transformers>=4.36.0
pip install accelerate>=0.25.0
pip install torch>=2.1.0
pip install xformers # Optional, for memory efficiency
Model Card Contact
For questions, issues, or collaboration inquiries:
- Email: ai-research@lightricks.com
- Website: https://www.lightricks.com/research
- Hugging Face: https://huggingface.co/Lightricks
Last Updated: October 2024 Model Version: WAN 2.2 Text Encoders FP16 Format Version: GGUF Repository Maintainer: Lightricks Research Team
- Downloads last month
- 6,011
16-bit