File size: 9,231 Bytes
af07dbe
a49cdd1
 
af07dbe
a49cdd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af07dbe
a49cdd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
language:
- it
license: apache-2.0
base_model: openai/whisper-small
tags:
- hf-asr-leaderboard
- generated_from_trainer
metrics:
- wer
model-index[0]:
- name: Whisper Small IT
- name: results
  results:
  - task:
      name: Text Classification
      type: text-classification
    dataset:
      name: emotion
      type: emotion
      args: default
    metrics:
    - name: Accuracy
      type: accuracy
      value: 0.925
    - name: F1
      type: f1
      value: 0.9251012149383893
datasets:
- mozilla-foundation/common_voice_11_0
---

# Whisper Small - Italian

This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) 
on the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0).
It achieves the following results on the evaluation set:
- Loss: 0.4549
- Wer: 200.40

## Model description

Whisper is a pre-trained model for automatic speech recognition (ASR)
published in [September 2022](https://openai.com/blog/whisper/) by the authors
Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as
[Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained
on un-labelled audio data, Whisper is pre-trained on a vast quantity of
**labelled** audio-transcription data, 680,000 hours to be precise.
This is an order of magnitude more data than the un-labelled audio data used
to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this
pre-training data is multilingual ASR data. This results in checkpoints
that can be applied to over 96 languages, many of which are considered
_low-resource_.

When scaled to 680,000 hours of labelled pre-training data, Whisper models
demonstrate a strong ability to generalise to many datasets and domains.
The pre-trained checkpoints achieve competitive results to state-of-the-art
ASR systems, with near 3% word error rate (WER) on the test-clean subset of
LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._
Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).
The extensive multilingual ASR knowledge acquired by Whisper during pre-training
can be leveraged for other low-resource languages; through fine-tuning, the
pre-trained checkpoints can be adapted for specific datasets and languages
to further improve upon these results.

## Intended uses & limitations

This fine-tuned model goals are to experiment and to allow the authors to
gain skills and knowledge on how this process is carried out. The model
serves as basis for the development of a small [gradio-hosted](here) application
that transcribes recordings and audio files in italian. This application also
allows to insert a YouTube link of an Italian video ad gain a transciption.

The limitations of this project mainly regard the limited resources available 
to fine-tune the model, namely Google Colab free-version and a Google Drive
used as feature storage, that had a limited space. The time dedicated to this 
project was also limited, as it had to fit within academic deadlines.

## Training and evaluation data

The training was carried out on Google Colab platform, and the evalutation data
(as the whole dataset) was taken from the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
reducing the dataset to only 10% of the original dataset, to avoid the training the model for too much time. 

## Training procedure

The training was conducted on Google Colab, using Jupyter Notebook to write code and document the training. Google Drive was used as Feature store.
Due to the limited resources of the free version of Google Colab, checkpointing was used to save partial results and then resume in a 
following run. The notebook was run 15 times, with approximately 40 min for each 100 steps of training for a total of 26.5h of training.
Keep in mind that Google Colab was available to us for no more than 4 h a day, so around 7 days were necessary for training alone.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 8
- training_steps: 4000
- gradient_accumulation_steps: 2
- save_steps: 100
- eval_steps: 100
  
### Training results

| Run Number    | Step         | Training Loss     | Validation Loss                | Wer                        |
|:-------------:|:------------:|:-----------------:|:------------------------------:|:--------------------------:|
| 1             | 100          | 1.2396            | 1.2330                         | 176.40                     |
| 2             | 200          | 0.7389            | 0.8331                         |  80.49                     |
| 2             | 300          | 0.2951            | 0.4261                         |  70.20                     |
| 2             | 400          | 0.2703            | 0.4051                         | 101.60                     |
| 3             | 500          | 0.2491            | 0.3923                         | 112.20                     |
| 3             | 600          | 0.1700            | 0.3860                         | 107.10                     |
| 3             | 700          | 0.1603            | 0.3836                         |  90.36                     |
| 4             | 800          | 0.1607            | 0.3786                         | 135.00                     |
| 4             | 900          | 0.1540            | 0.3783                         |  99.05                     |
| 4             | 1000         | 0.1562            | 0.3667                         |  98.32                     |
| 4             | 1100         | 0.0723            | 0.3757                         | 158.90                     |
| 5             | 1200         | 0.0769            | 0.3789                         | 215.20                     |
| 5             | 1300         | 0.0814            | 0.3779                         | 170.50                     |
| 5             | 1400         | 0.0786            | 0.3770                         | 140.60                     |
| 5             | 1500         | 0.0673            | 0.3777                         | 137.10                     |
| 6             | 1600         | 0.0339            | 0.3892                         | 166.50                     |
| 7             | 1700         | 0.0324            | 0.3963                         | 170.90                     |
| 7             | 1800         | 0.0348            | 0.4004                         | 163.40                     |
| 8             | 1900         | 0.0345            | 0.4016                         | 158.60                     |
| 8             | 2000         | 0.0346            | 0.4020                         | 176.10                     |
| 8             | 2100         | 0.0317            | 0.4001                         | 134.70                     |
| 9             | 2200         | 0.0173            | 0.4141                         | 189.30                     |
| 9             | 2300         | 0.0174            | 0.4106                         | 175.00                     |
| 9             | 2400         | 0.0165            | 0.4204                         | 179.60                     |
| 10            | 2500         | 0.0172            | 0.4185                         | 186.10                     |
| 10            | 2600         | 0.0142            | 0.4175                         | 181.10                     |
| 11            | 2700         | 0.0090            | 0.4325                         | 161.70                     |
| 11            | 2800         | 0.0069            | 0.4362                         | 161.20                     |
| 11            | 2900         | 0.0093            | 0.4342                         | 157.50                     |
| 12            | 3000         | 0.0076            | 0.4352                         | 154.50                     |
| 12            | 3100         | 0.0089            | 0.4394                         | 184.30                     |
| 13            | 3200         | 0.0063            | 0.4454                         | 166.00                     |
| 13            | 3300         | 0.0059            | 0.4476                         | 179.20                     |
| 13            | 3400         | 0.0058            | 0.4490                         | 189.60                     |
| 14            | 3500         | 0.0051            | 0.4502                         | 194.20                     |
| 14            | 3600         | 0.0064            | 0.4512                         | 187.40                     |
| 14            | 3700         | 0.0053            | 0.4520                         | 190.20                     |
| 14            | 3800         | 0.0049            | 0.4545                         | 194.90                     |
| 15            | 3900         | 0.0052            | 0.4546                         | 199.60                     |
| 15            | 4000         | 0.0054            | 0.4549                         | 200.40                     |

### Framework versions
- Transformers 4.36.0.dev0
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0