language: sv
metrics:
- wer
tags:
- audio
- automatic-speech-recognition
- speech
- hf-asr-leaderboard
- sv
license: cc0-1.0
datasets:
- common_voice
- NST_Swedish_ASR_Database
- P4
- The_Swedish_Culturomics_Gigaword_Corpus
model-index:
- name: Wav2vec 2.0 large VoxRex Swedish (C) with 4-gram
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 6.1
type: common_voice
args: sv-SE
metrics:
- name: Test WER
type: wer
value: 6.4723
KBLab's wav2vec 2.0 large VoxRex Swedish (C) with 4-gram model
Training of the acoustic model is the work of KBLab. See VoxRex-C for more details. This repo extends the acoustic model with a social media 4-gram language model for boosted performance.
Model description
VoxRex-C is extended with a 4-gram language model estimated from a subset extracted from The Swedish Culturomics Gigaword Corpus from Språkbanken. The subset contains 40M words from the social media genre between 2010 and 2015.
How to use
Simple usage example with pipeline
import torch
from transformers import pipeline
# Load the model. Using GPU if available
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = pipeline(model=model_name).to(device)
# Run inference on an audio file
output = pipe('path/to/audio.mp3')['text']
More verbose usage example with audio pre-processing
Example of transcribing 1% of the Common Voice test split. The model expects 16kHz audio, so audio with another sampling rate is resampled to 16kHz.
from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from datasets import load_dataset
import torch
import torchaudio.functional as F
# Import model and processor. Using GPU if available
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device);
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)
# Import and process speech data
common_voice = load_dataset('common_voice', 'sv-SE', split='test[:1%]')
def speech_file_to_array(sample):
# Convert speech file to array and downsample to 16 kHz
sampling_rate = sample['audio']['sampling_rate']
sample['speech'] = F.resample(torch.tensor(sample['audio']['array']), sampling_rate, 16_000)
return sample
common_voice = common_voice.map(speech_file_to_array)
# Run inference
inputs = processor(common_voice['speech'], sampling_rate=16_000, return_tensors='pt', padding=True).to(device)
with torch.no_grad():
logits = model(**inputs).logits
transcripts = processor.batch_decode(logits.cpu().numpy()).text
Training procedure
Text data for the n-gram model is pre-processed by removing characters not part of the wav2vec 2.0 vocabulary and uppercasing all characters. After pre-processing and storing each text sample on a new line in a text file, a KenLM model is estimated. See this tutorial for more details.
Evaluation results
The model was evaluated on the full Common Voice test set version 6.1. VoxRex-C achieved a WER of 9.03% without the language model and 6.47% with the language model.