Silemo commited on
Commit
a49cdd1
1 Parent(s): 3c397da

Update README adding training results

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md CHANGED
@@ -1,3 +1,152 @@
1
  ---
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - it
4
  license: apache-2.0
5
+ base_model: openai/whisper-small
6
+ tags:
7
+ - hf-asr-leaderboard
8
+ - generated_from_trainer
9
+ metrics:
10
+ - wer
11
+ model-index[0]:
12
+ - name: Whisper Small IT
13
+ - name: results
14
+ results:
15
+ - task:
16
+ name: Text Classification
17
+ type: text-classification
18
+ dataset:
19
+ name: emotion
20
+ type: emotion
21
+ args: default
22
+ metrics:
23
+ - name: Accuracy
24
+ type: accuracy
25
+ value: 0.925
26
+ - name: F1
27
+ type: f1
28
+ value: 0.9251012149383893
29
+ datasets:
30
+ - mozilla-foundation/common_voice_11_0
31
  ---
32
+
33
+ # Whisper Small - Italian
34
+
35
+ This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small)
36
+ on the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0).
37
+ It achieves the following results on the evaluation set:
38
+ - Loss: 0.4549
39
+ - Wer: 200.40
40
+
41
+ ## Model description
42
+
43
+ Whisper is a pre-trained model for automatic speech recognition (ASR)
44
+ published in [September 2022](https://openai.com/blog/whisper/) by the authors
45
+ Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as
46
+ [Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained
47
+ on un-labelled audio data, Whisper is pre-trained on a vast quantity of
48
+ **labelled** audio-transcription data, 680,000 hours to be precise.
49
+ This is an order of magnitude more data than the un-labelled audio data used
50
+ to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this
51
+ pre-training data is multilingual ASR data. This results in checkpoints
52
+ that can be applied to over 96 languages, many of which are considered
53
+ _low-resource_.
54
+
55
+ When scaled to 680,000 hours of labelled pre-training data, Whisper models
56
+ demonstrate a strong ability to generalise to many datasets and domains.
57
+ The pre-trained checkpoints achieve competitive results to state-of-the-art
58
+ ASR systems, with near 3% word error rate (WER) on the test-clean subset of
59
+ LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._
60
+ Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).
61
+ The extensive multilingual ASR knowledge acquired by Whisper during pre-training
62
+ can be leveraged for other low-resource languages; through fine-tuning, the
63
+ pre-trained checkpoints can be adapted for specific datasets and languages
64
+ to further improve upon these results.
65
+
66
+ ## Intended uses & limitations
67
+
68
+ This fine-tuned model goals are to experiment and to allow the authors to
69
+ gain skills and knowledge on how this process is carried out. The model
70
+ serves as basis for the development of a small [gradio-hosted](here) application
71
+ that transcribes recordings and audio files in italian. This application also
72
+ allows to insert a YouTube link of an Italian video ad gain a transciption.
73
+
74
+ The limitations of this project mainly regard the limited resources available
75
+ to fine-tune the model, namely Google Colab free-version and a Google Drive
76
+ used as feature storage, that had a limited space. The time dedicated to this
77
+ project was also limited, as it had to fit within academic deadlines.
78
+
79
+ ## Training and evaluation data
80
+
81
+ The training was carried out on Google Colab platform, and the evalutation data
82
+ (as the whole dataset) was taken from the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
83
+ reducing the dataset to only 10% of the original dataset, to avoid the training the model for too much time.
84
+
85
+ ## Training procedure
86
+
87
+ The training was conducted on Google Colab, using Jupyter Notebook to write code and document the training. Google Drive was used as Feature store.
88
+ Due to the limited resources of the free version of Google Colab, checkpointing was used to save partial results and then resume in a
89
+ following run. The notebook was run 15 times, with approximately 40 min for each 100 steps of training for a total of 26.5h of training.
90
+ Keep in mind that Google Colab was available to us for no more than 4 h a day, so around 7 days were necessary for training alone.
91
+
92
+ ### Training hyperparameters
93
+
94
+ The following hyperparameters were used during training:
95
+ - learning_rate: 1e-05
96
+ - train_batch_size: 16
97
+ - eval_batch_size: 8
98
+ - training_steps: 4000
99
+ - gradient_accumulation_steps: 2
100
+ - save_steps: 100
101
+ - eval_steps: 100
102
+
103
+ ### Training results
104
+
105
+ | Run Number | Step | Training Loss | Validation Loss | Wer |
106
+ |:-------------:|:------------:|:-----------------:|:------------------------------:|:--------------------------:|
107
+ | 1 | 100 | 1.2396 | 1.2330 | 176.40 |
108
+ | 2 | 200 | 0.7389 | 0.8331 | 80.49 |
109
+ | 2 | 300 | 0.2951 | 0.4261 | 70.20 |
110
+ | 2 | 400 | 0.2703 | 0.4051 | 101.60 |
111
+ | 3 | 500 | 0.2491 | 0.3923 | 112.20 |
112
+ | 3 | 600 | 0.1700 | 0.3860 | 107.10 |
113
+ | 3 | 700 | 0.1603 | 0.3836 | 90.36 |
114
+ | 4 | 800 | 0.1607 | 0.3786 | 135.00 |
115
+ | 4 | 900 | 0.1540 | 0.3783 | 99.05 |
116
+ | 4 | 1000 | 0.1562 | 0.3667 | 98.32 |
117
+ | 4 | 1100 | 0.0723 | 0.3757 | 158.90 |
118
+ | 5 | 1200 | 0.0769 | 0.3789 | 215.20 |
119
+ | 5 | 1300 | 0.0814 | 0.3779 | 170.50 |
120
+ | 5 | 1400 | 0.0786 | 0.3770 | 140.60 |
121
+ | 5 | 1500 | 0.0673 | 0.3777 | 137.10 |
122
+ | 6 | 1600 | 0.0339 | 0.3892 | 166.50 |
123
+ | 7 | 1700 | 0.0324 | 0.3963 | 170.90 |
124
+ | 7 | 1800 | 0.0348 | 0.4004 | 163.40 |
125
+ | 8 | 1900 | 0.0345 | 0.4016 | 158.60 |
126
+ | 8 | 2000 | 0.0346 | 0.4020 | 176.10 |
127
+ | 8 | 2100 | 0.0317 | 0.4001 | 134.70 |
128
+ | 9 | 2200 | 0.0173 | 0.4141 | 189.30 |
129
+ | 9 | 2300 | 0.0174 | 0.4106 | 175.00 |
130
+ | 9 | 2400 | 0.0165 | 0.4204 | 179.60 |
131
+ | 10 | 2500 | 0.0172 | 0.4185 | 186.10 |
132
+ | 10 | 2600 | 0.0142 | 0.4175 | 181.10 |
133
+ | 11 | 2700 | 0.0090 | 0.4325 | 161.70 |
134
+ | 11 | 2800 | 0.0069 | 0.4362 | 161.20 |
135
+ | 11 | 2900 | 0.0093 | 0.4342 | 157.50 |
136
+ | 12 | 3000 | 0.0076 | 0.4352 | 154.50 |
137
+ | 12 | 3100 | 0.0089 | 0.4394 | 184.30 |
138
+ | 13 | 3200 | 0.0063 | 0.4454 | 166.00 |
139
+ | 13 | 3300 | 0.0059 | 0.4476 | 179.20 |
140
+ | 13 | 3400 | 0.0058 | 0.4490 | 189.60 |
141
+ | 14 | 3500 | 0.0051 | 0.4502 | 194.20 |
142
+ | 14 | 3600 | 0.0064 | 0.4512 | 187.40 |
143
+ | 14 | 3700 | 0.0053 | 0.4520 | 190.20 |
144
+ | 14 | 3800 | 0.0049 | 0.4545 | 194.90 |
145
+ | 15 | 3900 | 0.0052 | 0.4546 | 199.60 |
146
+ | 15 | 4000 | 0.0054 | 0.4549 | 200.40 |
147
+
148
+ ### Framework versions
149
+ - Transformers 4.36.0.dev0
150
+ - Pytorch 2.1.0+cu118
151
+ - Datasets 2.15.0
152
+ - Tokenizers 0.15.0