metadata
license: apache-2.0
base_model: facebook/wav2vec2-large-xlsr-53
tags:
- generated_from_trainer
datasets:
- common_voice_13_0
metrics:
- wer
model-index:
- name: wav2vec2-large-xlsr-mvc-swahili
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: common_voice_13_0
type: common_voice_13_0
config: sw
split: test
args: sw
metrics:
- name: Wer
type: wer
value: 0.2
language:
- sw
wav2vec2-large-xlsr-mvc-swahili
This model is a finetuned version of facebook/wav2vec2-large-xlsr-53.
How to use the model
There was an issue with vocab, seems like there are special characters included and they were not considered during training
You could try
from transformers import AutoProcessor, AutoModelForCTC
repo_name = "eddiegulay/wav2vec2-large-xlsr-mvc-swahili"
processor = AutoProcessor.from_pretrained(repo_name)
model = AutoModelForCTC.from_pretrained(repo_name)
# if you have GPU
# move model to CUDA
model = model.to("cuda")
def transcribe(audio_path):
# Load the audio file
audio_input, sample_rate = torchaudio.load(audio_path)
target_sample_rate = 16000
audio_input = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)(audio_input)
# Preprocess the audio data
input_dict = processor(audio_input[0], return_tensors="pt", padding=True, sampling_rate=16000)
# Perform inference and transcribe
logits = model(input_dict.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)[0]
transcription = processor.decode(pred_ids)
return transcription
transcript = transcribe('your_audio.mp3')