Fine-tuned whisper-medium model for ASR in French

This model is a fine-tuned version of openai/whisper-medium, trained on the mozilla-foundation/common_voice_11_0 fr dataset. When using the model make sure that your speech input is also sampled at 16Khz. This model also predicts casing and punctuation.

Performance

Below are the WERs of the pre-trained models on the Common Voice 9.0, Multilingual LibriSpeech, Voxpopuli and Fleurs. These results are reported in the original paper.

Model	Common Voice 9.0	MLS	VoxPopuli	Fleurs
openai/whisper-small	22.7	16.2	15.7	15.0
openai/whisper-medium	16.0	8.9	12.2	8.7
openai/whisper-large	14.7	8.9	11.0	7.7
openai/whisper-large-v2	13.9	7.3	11.4	8.3

Below are the WERs of the fine-tuned models on the Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, and Fleurs. Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as WER (greedy search) / WER (beam search with beam width 5).

Model	Common Voice 11.0	MLS	VoxPopuli	Fleurs
bofenghuang/whisper-small-cv11-french	11.76 / 10.99	9.65 / 8.91	14.45 / 13.66	10.76 / 9.83
bofenghuang/whisper-medium-cv11-french	9.03 / 8.54	6.34 / 5.86	11.64 / 11.35	7.13 / 6.85
bofenghuang/whisper-medium-french	9.03 / 8.73	4.60 / 4.44	9.53 / 9.46	6.33 / 5.94
bofenghuang/whisper-large-v2-cv11-french	8.05 / 7.67	5.56 / 5.28	11.50 / 10.69	5.42 / 5.05
bofenghuang/whisper-large-v2-french	8.15 / 7.83	4.20 / 4.03	9.10 / 8.66	5.22 / 4.98

Usage

Inference with 🤗 Pipeline

import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-cv11-french", device=device)

# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]  # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"]  # beam search

# Normalise predicted sentences if necessary

Inference with 🤗 low-level APIs

import torch
import torchaudio

from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-cv11-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-cv11-french", language="french", task="transcribe")

# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)

# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)  # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5)  # beam search

# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Normalise predicted sentences if necessary

Downloads last month: 13

Space using bofenghuang/whisper-medium-cv11-french 1

Collection including bofenghuang/whisper-medium-cv11-french

French Whisper v0.0

Collection

French-optimized Whisper models for speech recognition. • 5 items • Updated Nov 12, 2024

Evaluation results

WER (Greedy) on Common Voice 11.0
test set self-reported

9.030
WER (Beam 5) on Common Voice 11.0
test set self-reported

8.540
WER (Greedy) on Multilingual LibriSpeech (MLS)
test set self-reported

6.340
WER (Beam 5) on Multilingual LibriSpeech (MLS)
test set self-reported

5.860
WER (Greedy) on VoxPopuli
test set self-reported

11.640
WER (Beam 5) on VoxPopuli
test set self-reported

11.350
WER (Greedy) on Fleurs
test set self-reported

7.130
WER (Beam 5) on Fleurs
test set self-reported

6.850
WER (Greedy) on African Accented French
test set self-reported

8.880
WER (Beam 5) on African Accented French
test set self-reported

7.020

View on Papers With Code