whisper-large-v3-tiny-caesar

Click to expand

Model Description
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

Summary

The "whisper-large-v3-tiny-caesar" is an acoustic model based on "openai/whisper-large-v3" suitable for Automatic Speech Recognition in code switching conditions between Spanish and Catalan.

Model Description

The "whisper-large-v3-tiny-caesar" is an acoustic model suitable for Automatic Speech Recognition in code switching conditions between Spanish and Catalan. It is the result of finetuning the model "openai/whisper-large-v3" with 2 hours of synthetic code switching data in Spanish/Catalan generated by the Projecte AINA from Barcelona, Spain.

CAESAR is an acronym with the following meaning:

(CA)talan (ES)panish (A)utomatic (R)ecognition

While "tiny" indicates that this model was finetuned with a very small amount of synthetic data (2 hours only).

Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in code switching conditions between Spanish and Catalan. The model is intended to transcribe audio files to plain text.

How to Get Started with the Model

To see an updated and functional version of this code, please see our our Notebook

Installation

In order to use this model, you may install datasets and transformers:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

pip install datasets transformers

For Inference

In order to transcribe audio in Catalan using this model, you can follow this example:

#Install Prerequisites
pip install torch
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer

#This code works with GPU

#Notice that: load_metric is no longer part of datasets.
#you have to remove it and use evaluate's load instead.
#(Note from November 2024)

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

#Load the processor and model.
MODEL_NAME="projecte-aina/whisper-large-v3-tiny-caesar"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/3catparla_asr",split='test')

#Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

#Process the dataset
def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    
    return batch
    
#Do the evaluation
result = ds.map(map_to_pred)

#Compute the overall WER now.
from evaluate import load

wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)

Training Details

Training data

The specific dataset used to create the model is a corpus called CAESAR-tiny which has not been released at the moment.

Training procedure

This model is the result of finetuning the model "openai/whisper-large-v3" by following this tutorial provided by Hugging Face.

Training Hyperparameters

language: Spanish
hours of training audio: 2
learning rate: 1e-5
sample rate: 16000
train batch size: 32 (x4 GPUs)
- gradient accumulation steps: 1
eval batch size: 32
save total limit: 3
max steps: 80
warmup steps: 8
eval steps: 8
save steps: 8
shuffle buffer size: 480

Citation

If this model contributes to your research, please cite the work:

@misc{mena2024whisperlarge3catparla,
      title={Acoustic Model in Catalan: whisper-large-v3-tiny-caesar.}, 
      author={Hernandez Mena, Carlos Daniel; Giraldo, Jose ;Armentano-Oller, Carme; Solito, Sarah; Messaoudi, Abir; Costa, Federico; Zeballos, Rodolfo},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/projecte-aina/whisper-large-v3-tiny-caesar},
      year={2024}
}

Additional Information

Author

The fine-tuning process was perform during November (2024) in the Language Technologies Unit of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

License

Apache-2.0

Funding

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

The training of the model was possible thanks to the compute time provided by Barcelona Supercomputing Center through MareNostrum 5.

Downloads last month: 416

Safetensors

Model size

2B params

Tensor type

F32

Model tree for projecte-aina/whisper-large-v3-tiny-caesar

Base model

openai/whisper-large-v3

Finetuned

(641)

this model

Spaces using projecte-aina/whisper-large-v3-tiny-caesar 2

Collection including projecte-aina/whisper-large-v3-tiny-caesar

SPEECH_Models

Collection

ASR, TTS and other speech/audio related tasks • 12 items • Updated Mar 27 • 1

projecte-aina
/

whisper-large-v3-tiny-caesar