--- license: apache-2.0 language: - es - ca base_model: - openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition library_name: transformers tags: - bsc - projecte-aina - barcelona-supercomputing-center - automatic-speech-recognition - whisper-large-v3 - code-switching - spanish-catalan - spanish - catalan --- # whisper-large-v3-tiny-caesar ## Table of Contents
Click to expand - [Model Description](#model-description) - [Intended Uses and Limitations](#intended-uses-and-limitations) - [How to Get Started with the Model](#how-to-get-started-with-the-model) - [Training Details](#training-details) - [Citation](#citation) - [Additional Information](#additional-information)
## Summary The "whisper-large-v3-tiny-caesar" is an acoustic model based on ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) suitable for Automatic Speech Recognition in code switching conditions between Spanish and Catalan. ## Model Description The "whisper-large-v3-tiny-caesar" is an acoustic model suitable for Automatic Speech Recognition in code switching conditions between Spanish and Catalan. It is the result of finetuning the model ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) with 2 hours of synthetic code switching data in Spanish/Catalan generated by the [Projecte AINA](https://projecteaina.cat/) from Barcelona, Spain. CAESAR is an acronym with the following meaning: (CA)talan (ES)panish (A)utomatic (R)ecognition While "tiny" indicates that this model was finetuned with a very small amount of synthetic data (2 hours only). ## Intended Uses and Limitations This model can be used for Automatic Speech Recognition (ASR) in code switching conditions between Spanish and Catalan. The model is intended to transcribe audio files to plain text. ## How to Get Started with the Model To see an updated and functional version of this code, please see our our [Notebook](https://colab.research.google.com/drive/1MHiPrffNTwiyWeUyMQvSdSbfkef_8aJC?usp=sharing) ### Installation In order to use this model, you may install [datasets](https://huggingface.co/docs/datasets/installation) and [transformers](https://huggingface.co/docs/transformers/installation): Create a virtual environment: ```bash python -m venv /path/to/venv ``` Activate the environment: ```bash source /path/to/venv/bin/activate ``` Install the modules: ```bash pip install datasets transformers ``` ### For Inference In order to transcribe audio in Catalan using this model, you can follow this example: ```bash #Install Prerequisites pip install torch pip install datasets pip install 'transformers[torch]' pip install evaluate pip install jiwer ``` ```python #This code works with GPU #Notice that: load_metric is no longer part of datasets. #you have to remove it and use evaluate's load instead. #(Note from November 2024) import torch from transformers import WhisperForConditionalGeneration, WhisperProcessor #Load the processor and model. MODEL_NAME="projecte-aina/whisper-large-v3-tiny-caesar" processor = WhisperProcessor.from_pretrained(MODEL_NAME) model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda") #Load the dataset from datasets import load_dataset, load_metric, Audio ds=load_dataset("projecte-aina/3catparla_asr",split='test') #Downsample to 16kHz ds = ds.cast_column("audio", Audio(sampling_rate=16_000)) #Process the dataset def map_to_pred(batch): audio = batch["audio"] input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features batch["reference"] = processor.tokenizer._normalize(batch['normalized_text']) with torch.no_grad(): predicted_ids = model.generate(input_features.to("cuda"))[0] transcription = processor.decode(predicted_ids) batch["prediction"] = processor.tokenizer._normalize(transcription) return batch #Do the evaluation result = ds.map(map_to_pred) #Compute the overall WER now. from evaluate import load wer = load("wer") WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"]) print(WER) ``` ## Training Details ### Training data The specific dataset used to create the model is a corpus called CAESAR-tiny which has not been released at the moment. ### Training procedure This model is the result of finetuning the model ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) by following this [tutorial](https://huggingface.co/blog/fine-tune-whisper) provided by Hugging Face. ### Training Hyperparameters * language: Spanish * hours of training audio: 2 * learning rate: 1e-5 * sample rate: 16000 * train batch size: 32 (x4 GPUs) * gradient accumulation steps: 1 * eval batch size: 32 * save total limit: 3 * max steps: 80 * warmup steps: 8 * eval steps: 8 * save steps: 8 * shuffle buffer size: 480 ## Citation If this model contributes to your research, please cite the work: ```bibtex @misc{mena2024whisperlarge3catparla, title={Acoustic Model in Catalan: whisper-large-v3-tiny-caesar.}, author={Hernandez Mena, Carlos Daniel; Giraldo, Jose ;Armentano-Oller, Carme; Solito, Sarah; Messaoudi, Abir; Acosta, Federico; Zeballos, Rodolfo}, organization={Barcelona Supercomputing Center}, url={https://huggingface.co/projecte-aina/whisper-large-v3-tiny-caesar}, year={2024} } ``` ## Additional Information ### Author The fine-tuning process was perform during November (2024) in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena). ### Contact For further information, please send an email to . ### Copyright Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center. ### License [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) ### Funding This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). The training of the model was possible thanks to the compute time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.