Fine-tuned wav2vec2-FR-7K-large model for ASR in French

This model is a fine-tuned version of LeBenchmark/wav2vec2-FR-7K-large, trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, Multilingual TEDx, MediaSpeech, and African Accented French. When using the model make sure that your speech input is also sampled at 16Khz.

Usage

To use on a local audio file with the language model

import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]

To use on a local audio file without the language model

import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2Processor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]

Evaluation

To evaluate on mozilla-foundation/common_voice_11_0

python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "mozilla-foundation/common_voice_11_0" \
  --config "fr" \
  --split "test" \
  --log_outputs \
  --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"

To evaluate on speech-recognition-community-v2/dev_data

python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "speech-recognition-community-v2/dev_data" \
  --config "fr" \
  --split "validation" \
  --chunk_length_s 30.0 \
  --stride_length_s 5.0 \
  --log_outputs \
  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"

Downloads last month: 620

Safetensors

Model size

315M params

Tensor type

F32

Inference Examples

Automatic Speech Recognition

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train bofenghuang/asr-wav2vec2-ctc-french

Spaces using bofenghuang/asr-wav2vec2-ctc-french 5

Collection including bofenghuang/asr-wav2vec2-ctc-french

French Speech-to-Text

Collection

Curated collection of models for transcribing French audio to text. • 3 items • Updated Nov 12

Evaluation results

Test WER on Common Voice 11.0
self-reported

11.440
Test WER (+LM) on Common Voice 11.0
self-reported

9.660
Test WER on Multilingual LibriSpeech (MLS)
self-reported

5.930
Test WER (+LM) on Multilingual LibriSpeech (MLS)
self-reported

5.130
Test WER on VoxPopuli
self-reported

9.330
Test WER (+LM) on VoxPopuli
self-reported

8.510
Test WER on African Accented French
self-reported

16.220
Test WER (+LM) on African Accented French
self-reported

15.390
Test WER on Robust Speech Event - Dev Data
self-reported

16.560
Test WER (+LM) on Robust Speech Event - Dev Data
self-reported

12.960

View on Papers With Code