You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Your input helps us strengthen the pyannote community and improve our open-source offerings. This pipeline is released under the CC-BY-4.0 license and will always remain freely accessible. By providing your details, you agree that we may email you occasionally with important news about pyannote models, invitations to try premium pipelines, and information about specific services designed for researchers and professionals like you.
Log in or Sign Up to review the conditions and access this model content.
community-1
speaker diarization
This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization.
- stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
- audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
The main improvements brought by Community-1
are:
- improved speaker assignment and counting
- simpler reconciliation with transcription timestamps with exclusive speaker diarization
- easy offline use (i.e. without internet connection)
- (optionally) hosted on pyannoteAI cloud
Setup
pip install pyannote.audio
- Accept user conditions
- Create access token at
hf.co/settings/tokens
.
Quick start
# download the pipeline from Huggingface
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-community-1",
token="{huggingface-token}")
# run the pipeline locally on your computer
output = pipeline("audio.wav")
# print the predicted speaker diarization
for turn, speaker in output.speaker_diarization:
print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")
Benchmark
Out of the box, Community-1
is much better than speaker-diarization-3.1
.
We report diarization error rates (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).
Benchmark (last updated in 2025-09) | legacy (3.1) |
community-1 |
precision-2 |
---|---|---|---|
AISHELL-4 | 12.2 | 11.7 | 11.4 |
AliMeeting (channel 1) | 24.5 | 20.3 | 15.2 |
AMI (IHM) | 18.8 | 17.0 | 12.9 |
AMI (SDM) | 22.7 | 19.9 | 15.6 |
AVA-AVD | 49.7 | 44.6 | 37.1 |
CALLHOME (part 2) | 28.5 | 26.7 | 16.6 |
DIHARD 3 (full) | 21.4 | 20.2 | 14.7 |
Ego4D (dev.) | 51.2 | 46.8 | 39.0 |
MSDWild | 25.4 | 22.8 | 17.3 |
RAMC | 22.2 | 20.8 | 10.5 |
REPERE (phase2) | 7.9 | 8.9 | 7.4 |
VoxConverse (v0.3) | 11.2 | 11.2 | 8.5 |
Precision-2
model is even better and can be tested like this:
- Create an API key on pyannoteAI dashboard (free credits included)
- Change one line of code
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
- 'pyannote/speaker-diarization-community-1', token="{huggingface-token}")
+ 'pyannote/speaker-diarization-precision-2', token="{pyannoteAI-api-key}")
diarization = pipeline("audio.wav") # runs on pyannoteAI servers
Processing on GPU
pyannote.audio
pipelines run on CPU by default.
You can send them to GPU with the following lines:
import torch
pipeline.to(torch.device("cuda"))
Processing from memory
Pre-loading audio files in memory may result in faster processing:
waveform, sample_rate = torchaudio.load("audio.wav")
output = pipeline({"waveform": waveform, "sample_rate": sample_rate})
Monitoring progress
Hooks are available to monitor the progress of the pipeline:
from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
output = pipeline("audio.wav", hook=hook)
Controlling the number of speakers
In case the number of speakers is known in advance, one can use the num_speakers
option:
output = pipeline("audio.wav", num_speakers=2)
One can also provide lower and/or upper bounds on the number of speakers using min_speakers
and max_speakers
options:
output = pipeline("audio.wav", min_speakers=2, max_speakers=5)
Exclusive speaker diarization
Community-1
pretrained pipeline returns a new exclusive speaker diarization, on top of the regular speaker diarization, available as output.exclusive_speaker_diarization
.
This is a feature which is backported from our latest commercial model that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.
Offline use
- In the terminal, copy the pipeline on disk:
# make sure git-lfs is installed (https://git-lfs.com)
git lfs install
# create a directory on disk
mkdir /path/to/directory
# when prompted for a password, use an access token with write permissions.
# generate one from your settings: https://huggingface.co/settings/tokens
git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
- In Python, use the pipeline without internet connection:
# load pipeline from disk (works without internet connection)
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1')
# run the pipeline locally on your computer
output = pipeline("audio.wav")
Citations
- Speaker segmentation model
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
- Speaker embedding model
@inproceedings{Wang2023,
title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}
- Speaker clustering
@article{Landini2022,
author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}},
year={2022},
journal={Computer Speech \& Language},
}
Acknowledgment
Training and tuning made possible thanks to GENCI on the Jean Zay supercomputer.
- Downloads last month
- 18,914