Model Card for whisper-small-es-cl
Finetuned Whisper Model for Automatic Speech Recognition in Spanish from Chile
Model Details
Whisper is a model from OpenAI, based on the GPT architecture, designed to generate transcription text from audio sequences. This model can be used with the WhisperX Pipeline developed by @m-bain, which integrates transcription text with the pyannotate library to perform diarization. Diarization involves separating and classifying speakers in the audio, resulting in a transcription segmented by speaker. This process returns a transcription object with identified segments associated with specific speakers.
Model Description
- Developed by: Open AI
- Model type: Automatic Speech Recognition
- Language(s) (NLP): Spanish from Chile
- License: MIT
- Finetuned from model Whisper-Small: openai/whisper-small
Model Details
- Name: whisper-small-es-cl
- Model Type: Sequence-to-sequence automatic speech recognition
- Parameters: 39M
- Neurons Width: 768
- Attention Heads: 12
- Layers: 12
- Input Activation Function: GELU
- Output Activation Function: Softmax
Model Sources
- Repository: openai/whisper-small
- Paper: Robust Speech Recognition via Large-Scale Weak Supervision https://arxiv.org/abs/2212.04356
Uses
Designed for automatic speech recognition, it can generate the transcription of audio conversations in any spanish language
Direct Use
Returns the transcription text of a voice audio optimized for the spanish from Chile
Out-of-Scope Use
Research only purpose, non commercial use
Bias, Risks, and Limitations
This model is just an experimental example, there are plenty of specific words ie. names or brands, that are not recognized by the model. The efectiveness of the phoneme-based log-mel spectrogram is influenced by the amount and variety of data on which the model has been trained, in this particular case, more data will be needed but with the same dataset structure. The access to use this model is restricted and the specific details about the dataset are private data.
Recommendations
Use with a diarization pipeline as WhisperX to generate the diarized transcription, for this particular case, it will be needed the trasformation to the whisper faster format.
How to Get Started with the Model
Use the code below to get started with the model, download, transform and test. [https://github.com/al-mldev/whisperX-RAG-Analytics]
Training Details
Training Data
The model was fine-tuned with approximately 1.800 audio samples, including segmented tagged audios and specific tagged words.
Training Procedure
This model was finetuned with a dataset created with segmented tagged audio and specific words tagged audios, the segmentation was 90/10 where 90% of the data is for training, 5% for testing and 5% for validation.
Training Hyperparameters
- Loss Function: Cross Entropy
- Learning Rate: 2.25x10^-5
- Optimization Function: AdamW
- Regularization: Dropout (0,1-0,3)
- Epochs: 22
- Batch Size: 16
- Grad Acumulated Steps: 1
- Warmup Steps: 25
- Max Steps: 200
- Eval Step: 50
- Save Step: 50
- Eval Batch Size: 8
Testing Data, Factors & Summary
Testing Data
Factors
- Audio Quality: The model was tested with various audio qualities, including noisy and clean audio.
Summary
WER: 25.49%
Training Loss: 2 × 10^-3
Validation Loss: 5.85 × 10^-1
Environmental Impact
- Hardware Type: T4
- Hours used: 0.5
- Cloud Provider: Google Cloud Platform
- Compute Region: southamerica-east1
- Carbon Emitted: 0.01 kg of CO2
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Downloads last month
- 0