|
--- |
|
language: |
|
- it |
|
license: apache-2.0 |
|
base_model: openai/whisper-small |
|
tags: |
|
- hf-asr-leaderboard |
|
- generated_from_trainer |
|
metrics: |
|
- wer |
|
model-index[0]: |
|
- name: Whisper Small IT |
|
- name: results |
|
results: |
|
- task: |
|
name: Text Classification |
|
type: text-classification |
|
dataset: |
|
name: emotion |
|
type: emotion |
|
args: default |
|
metrics: |
|
- name: Accuracy |
|
type: accuracy |
|
value: 0.925 |
|
- name: F1 |
|
type: f1 |
|
value: 0.9251012149383893 |
|
datasets: |
|
- mozilla-foundation/common_voice_11_0 |
|
--- |
|
|
|
# Whisper Small - Italian |
|
|
|
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) |
|
on the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.4549 |
|
- Wer: 200.40 |
|
|
|
## Model description |
|
|
|
Whisper is a pre-trained model for automatic speech recognition (ASR) |
|
published in [September 2022](https://openai.com/blog/whisper/) by the authors |
|
Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as |
|
[Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained |
|
on un-labelled audio data, Whisper is pre-trained on a vast quantity of |
|
**labelled** audio-transcription data, 680,000 hours to be precise. |
|
This is an order of magnitude more data than the un-labelled audio data used |
|
to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this |
|
pre-training data is multilingual ASR data. This results in checkpoints |
|
that can be applied to over 96 languages, many of which are considered |
|
_low-resource_. |
|
|
|
When scaled to 680,000 hours of labelled pre-training data, Whisper models |
|
demonstrate a strong ability to generalise to many datasets and domains. |
|
The pre-trained checkpoints achieve competitive results to state-of-the-art |
|
ASR systems, with near 3% word error rate (WER) on the test-clean subset of |
|
LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._ |
|
Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)). |
|
The extensive multilingual ASR knowledge acquired by Whisper during pre-training |
|
can be leveraged for other low-resource languages; through fine-tuning, the |
|
pre-trained checkpoints can be adapted for specific datasets and languages |
|
to further improve upon these results. |
|
|
|
## Intended uses & limitations |
|
|
|
This fine-tuned model goals are to experiment and to allow the authors to |
|
gain skills and knowledge on how this process is carried out. The model |
|
serves as basis for the development of a small [gradio-hosted](here) application |
|
that transcribes recordings and audio files in italian. This application also |
|
allows to insert a YouTube link of an Italian video ad gain a transciption. |
|
|
|
The limitations of this project mainly regard the limited resources available |
|
to fine-tune the model, namely Google Colab free-version and a Google Drive |
|
used as feature storage, that had a limited space. The time dedicated to this |
|
project was also limited, as it had to fit within academic deadlines. |
|
|
|
## Training and evaluation data |
|
|
|
The training was carried out on Google Colab platform, and the evalutation data |
|
(as the whole dataset) was taken from the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
|
reducing the dataset to only 10% of the original dataset, to avoid the training the model for too much time. |
|
|
|
## Training procedure |
|
|
|
The training was conducted on Google Colab, using Jupyter Notebook to write code and document the training. Google Drive was used as Feature store. |
|
Due to the limited resources of the free version of Google Colab, checkpointing was used to save partial results and then resume in a |
|
following run. The notebook was run 15 times, with approximately 40 min for each 100 steps of training for a total of 26.5h of training. |
|
Keep in mind that Google Colab was available to us for no more than 4 h a day, so around 7 days were necessary for training alone. |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 8 |
|
- training_steps: 4000 |
|
- gradient_accumulation_steps: 2 |
|
- save_steps: 100 |
|
- eval_steps: 100 |
|
|
|
### Training results |
|
|
|
| Run Number | Step | Training Loss | Validation Loss | Wer | |
|
|:-------------:|:------------:|:-----------------:|:------------------------------:|:--------------------------:| |
|
| 1 | 100 | 1.2396 | 1.2330 | 176.40 | |
|
| 2 | 200 | 0.7389 | 0.8331 | 80.49 | |
|
| 2 | 300 | 0.2951 | 0.4261 | 70.20 | |
|
| 2 | 400 | 0.2703 | 0.4051 | 101.60 | |
|
| 3 | 500 | 0.2491 | 0.3923 | 112.20 | |
|
| 3 | 600 | 0.1700 | 0.3860 | 107.10 | |
|
| 3 | 700 | 0.1603 | 0.3836 | 90.36 | |
|
| 4 | 800 | 0.1607 | 0.3786 | 135.00 | |
|
| 4 | 900 | 0.1540 | 0.3783 | 99.05 | |
|
| 4 | 1000 | 0.1562 | 0.3667 | 98.32 | |
|
| 4 | 1100 | 0.0723 | 0.3757 | 158.90 | |
|
| 5 | 1200 | 0.0769 | 0.3789 | 215.20 | |
|
| 5 | 1300 | 0.0814 | 0.3779 | 170.50 | |
|
| 5 | 1400 | 0.0786 | 0.3770 | 140.60 | |
|
| 5 | 1500 | 0.0673 | 0.3777 | 137.10 | |
|
| 6 | 1600 | 0.0339 | 0.3892 | 166.50 | |
|
| 7 | 1700 | 0.0324 | 0.3963 | 170.90 | |
|
| 7 | 1800 | 0.0348 | 0.4004 | 163.40 | |
|
| 8 | 1900 | 0.0345 | 0.4016 | 158.60 | |
|
| 8 | 2000 | 0.0346 | 0.4020 | 176.10 | |
|
| 8 | 2100 | 0.0317 | 0.4001 | 134.70 | |
|
| 9 | 2200 | 0.0173 | 0.4141 | 189.30 | |
|
| 9 | 2300 | 0.0174 | 0.4106 | 175.00 | |
|
| 9 | 2400 | 0.0165 | 0.4204 | 179.60 | |
|
| 10 | 2500 | 0.0172 | 0.4185 | 186.10 | |
|
| 10 | 2600 | 0.0142 | 0.4175 | 181.10 | |
|
| 11 | 2700 | 0.0090 | 0.4325 | 161.70 | |
|
| 11 | 2800 | 0.0069 | 0.4362 | 161.20 | |
|
| 11 | 2900 | 0.0093 | 0.4342 | 157.50 | |
|
| 12 | 3000 | 0.0076 | 0.4352 | 154.50 | |
|
| 12 | 3100 | 0.0089 | 0.4394 | 184.30 | |
|
| 13 | 3200 | 0.0063 | 0.4454 | 166.00 | |
|
| 13 | 3300 | 0.0059 | 0.4476 | 179.20 | |
|
| 13 | 3400 | 0.0058 | 0.4490 | 189.60 | |
|
| 14 | 3500 | 0.0051 | 0.4502 | 194.20 | |
|
| 14 | 3600 | 0.0064 | 0.4512 | 187.40 | |
|
| 14 | 3700 | 0.0053 | 0.4520 | 190.20 | |
|
| 14 | 3800 | 0.0049 | 0.4545 | 194.90 | |
|
| 15 | 3900 | 0.0052 | 0.4546 | 199.60 | |
|
| 15 | 4000 | 0.0054 | 0.4549 | 200.40 | |
|
|
|
### Framework versions |
|
- Transformers 4.36.0.dev0 |
|
- Pytorch 2.1.0+cu118 |
|
- Datasets 2.15.0 |
|
- Tokenizers 0.15.0 |