VibeVoice-Large-Q8 - Selective 8bit Quantization

The first 8-bit VibeVoice model that actually works

License Model Size Quality

πŸ€— Model β€’ πŸ’» ComfyUI β€’ πŸ“– Docs


🎯 Why This Model is Different

If you've tried other 8-bit quantized VibeVoice models, you probably got nothing but static noise. This one actually works.

The secret? Selective quantization: I only quantized the language model (the most robust part), while keeping audio-critical components (diffusion head, VAE, connectors) at full precision.

Results

  • βœ… Perfect audio, identical to the original model
  • βœ… 11.6 GB instead of 18.7 GB (-38%)
  • βœ… Uses ~12 GB VRAM instead of 20 GB
  • βœ… Works on 12 GB GPUs (RTX 3060, 4070 Ti, etc.)

🚨 The Problem with Other 8-bit Models

Most 8-bit models you'll find online quantize everything aggressively: Result: Audio components get quantized β†’ numerical errors propagate β†’ audio = pure noise.


βœ… The Solution: Selective Quantization

I only quantized what can be safely quantized without losing quality.

Result: 52% of parameters quantized, 48% at full precision = perfect audio quality.


πŸ“Š Quick Comparison

Model Size Audio Quality Status
Original VibeVoice 18.7 GB ⭐⭐⭐⭐⭐ Full precision
Other 8-bit models 10.6 GB πŸ’₯ NOISE ❌ Don't work
This model 11.6 GB ⭐⭐⭐⭐⭐ βœ… Perfect

+1.0 GB vs other 8-bit models = perfect audio instead of noise. Worth it.


πŸ’» How to Use It

With Transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
import scipy.io.wavfile as wavfile

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "FabioSarracino/VibeVoice-Large-Q8",
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(
    "FabioSarracino/VibeVoice-Large-Q8",
    trust_remote_code=True
)

# Generate audio
text = "Hello, this is VibeVoice speaking."
inputs = processor(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=None)

# Save
audio = output.speech_outputs[0].cpu().numpy()
wavfile.write("output.wav", 24000, audio)

With ComfyUI (recommended)

  1. Install the custom node:

    cd ComfyUI/custom_nodes
    git clone https://github.com/Enemyx-net/VibeVoice-ComfyUI
    
  2. Download this model to ComfyUI/models/vibevoice/

  3. Restart ComfyUI and use it normally!


πŸ’Ύ System Requirements

Minimum

  • VRAM: 12 GB
  • RAM: 16 GB
  • GPU: NVIDIA with CUDA (required)
  • Storage: 11 GB

Recommended

  • VRAM: 16+ GB
  • RAM: 32 GB
  • GPU: RTX 3090/4090, A5000 or better

⚠️ Not supported: CPU, Apple Silicon (MPS), AMD GPUs


⚠️ Limitations

  1. Requires NVIDIA GPU with CUDA - won't work on CPU or Apple Silicon
  2. Inference only - don't use for fine-tuning
  3. Requires:
    • transformers>=4.51.3
    • bitsandbytes>=0.43.0

πŸ†š When to Use This Model

βœ… Use this 8-bit if:

  • You have 12-16 GB VRAM
  • You want maximum quality with reduced size
  • You need a production-ready model
  • You want the best size/quality balance

Use full precision (18.7 GB) if:

  • You have unlimited VRAM (24+ GB)
  • You're doing research requiring absolute precision

Use 4-bit NF4 (~6.6 GB) if:

  • You only have 8-10 GB VRAM
  • You can accept a small quality trade-off

πŸ”§ Troubleshooting

"OutOfMemoryError" during loading

  • Close other GPU applications
  • Use device_map="auto"
  • Reduce batch size to 1

"BitsAndBytes not found"

pip install bitsandbytes>=0.43.0

Audio sounds distorted

This shouldn't happen! If it does:

  1. Verify you downloaded the correct model
  2. Update transformers: pip install --upgrade transformers
  3. Check CUDA: torch.cuda.is_available() should return True

πŸ“š Citation

@misc{vibevoice-q8-2025,
  title={VibeVoice-Large-Q8: Selective 8-bit Quantization for Audio Quality},
  author={Fabio Sarracino},
  year={2025},
  url={https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8}
}

Original Model

@misc{vibevoice2024,
  title={VibeVoice: High-Quality Text-to-Speech with Large Language Models},
  author={Microsoft Research},
  year={2024},
  url={https://github.com/microsoft/VibeVoice}
}

πŸ”— Related Resources


πŸ“œ License

MIT License.


🀝 Support

If this model helped you, leave a ⭐ on GitHub!


Created by Fabio Sarracino

The first 8-bit VibeVoice model that actually works

πŸ€— HuggingFace β€’ πŸ’» GitHub

Downloads last month
2,769
Safetensors
Model size
9.34B params
Tensor type
F32
Β·
BF16
Β·
I8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support