WavCochV8192 — 8,192-code speech tokenizer (cochlear tokens)

WavCochV8192 is a biologically-inspired, learned audio quantizer that maps a raw waveform to discrete "cochlear tokens". It is used as the tokenizer for the AuriStream autoregressive speech/language model (e.g., TuKoResearch/AuriStream1B_librilight_ckpt500k). The model is trained on LibriSpeech960 and encodes audio into a time–frequency representation (Cochleagram; Feather et al., 2023 Nat Neuro) and reads out 8,192-way discrete codes through a low-bit latent bottleneck (LFQ). These tokens can be fed to a transformer LM for representation learning and next-token prediction (speech continuation).

API at a glance

  • Input: mono waveform at 16 kHz (pytorch tensor float32), shape (B, 1, T)
  • Output: token IDs, shape (B, L) returned as dictionary under key "input_ids"
  • Implemented as a transformers custom model — load with trust_remote_code=True.

Installation

pip install -U torch torchaudio transformers

Quickstart — Quantize a waveform into cochlear tokens

import torch, torchaudio
from transformers import AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the quantizer
quantizer = AutoModel.from_pretrained(
    "TuKoResearch/WavCochV8192", trust_remote_code=True
).to(device).eval()

# Load & prep audio (mono, 16 kHz)
wav, sr = torchaudio.load("sample.wav")
if wav.size(0) > 1:  # stereo -> mono
    wav = wav.mean(dim=0, keepdim=True)
if sr != 16_000:
    wav = torchaudio.transforms.Resample(sr, 16_000)(wav)
    sr = 16_000

# Forward pass — returns a dict with "input_ids" = (B, L)
with torch.no_grad():
    out = quantizer(wav.unsqueeze(0).to(device))   # (1, 1, T) -> dict
    token_ids = out["input_ids"]                   # LongTensor (1, L)

print("Token IDs shape:", token_ids.shape)

Intended uses & limitations

  • Uses: tokenization for speech LM training; compact storage/streaming of speech as discrete IDs, loosely inspired by human biology.
  • Limitations: trained only on spoken English, so might not perform as well for other languages and non-speech sounds.

Citation

If you use this tokenizer please cite:

@inproceedings{tuckute2025cochleartokens,
  title     = {Representing Speech Through Autoregressive Prediction of Cochlear Tokens},
  author    = {Greta Tuckute and Klemen Kotar and Evelina Fedorenko and Daniel Yamins},
  booktitle = {Interspeech 2025},
  year      = {2025},
  pages     = {2180--2184},
  doi       = {10.21437/Interspeech.2025-2044},
  issn      = {2958-1796}
}

Related

Downloads last month
36
Safetensors
Model size
11.1M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support