Update README adding training results
Browse files
README.md
CHANGED
@@ -1,3 +1,152 @@
|
|
1 |
---
|
|
|
|
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- it
|
4 |
license: apache-2.0
|
5 |
+
base_model: openai/whisper-small
|
6 |
+
tags:
|
7 |
+
- hf-asr-leaderboard
|
8 |
+
- generated_from_trainer
|
9 |
+
metrics:
|
10 |
+
- wer
|
11 |
+
model-index[0]:
|
12 |
+
- name: Whisper Small IT
|
13 |
+
- name: results
|
14 |
+
results:
|
15 |
+
- task:
|
16 |
+
name: Text Classification
|
17 |
+
type: text-classification
|
18 |
+
dataset:
|
19 |
+
name: emotion
|
20 |
+
type: emotion
|
21 |
+
args: default
|
22 |
+
metrics:
|
23 |
+
- name: Accuracy
|
24 |
+
type: accuracy
|
25 |
+
value: 0.925
|
26 |
+
- name: F1
|
27 |
+
type: f1
|
28 |
+
value: 0.9251012149383893
|
29 |
+
datasets:
|
30 |
+
- mozilla-foundation/common_voice_11_0
|
31 |
---
|
32 |
+
|
33 |
+
# Whisper Small - Italian
|
34 |
+
|
35 |
+
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small)
|
36 |
+
on the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0).
|
37 |
+
It achieves the following results on the evaluation set:
|
38 |
+
- Loss: 0.4549
|
39 |
+
- Wer: 200.40
|
40 |
+
|
41 |
+
## Model description
|
42 |
+
|
43 |
+
Whisper is a pre-trained model for automatic speech recognition (ASR)
|
44 |
+
published in [September 2022](https://openai.com/blog/whisper/) by the authors
|
45 |
+
Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as
|
46 |
+
[Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained
|
47 |
+
on un-labelled audio data, Whisper is pre-trained on a vast quantity of
|
48 |
+
**labelled** audio-transcription data, 680,000 hours to be precise.
|
49 |
+
This is an order of magnitude more data than the un-labelled audio data used
|
50 |
+
to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this
|
51 |
+
pre-training data is multilingual ASR data. This results in checkpoints
|
52 |
+
that can be applied to over 96 languages, many of which are considered
|
53 |
+
_low-resource_.
|
54 |
+
|
55 |
+
When scaled to 680,000 hours of labelled pre-training data, Whisper models
|
56 |
+
demonstrate a strong ability to generalise to many datasets and domains.
|
57 |
+
The pre-trained checkpoints achieve competitive results to state-of-the-art
|
58 |
+
ASR systems, with near 3% word error rate (WER) on the test-clean subset of
|
59 |
+
LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._
|
60 |
+
Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).
|
61 |
+
The extensive multilingual ASR knowledge acquired by Whisper during pre-training
|
62 |
+
can be leveraged for other low-resource languages; through fine-tuning, the
|
63 |
+
pre-trained checkpoints can be adapted for specific datasets and languages
|
64 |
+
to further improve upon these results.
|
65 |
+
|
66 |
+
## Intended uses & limitations
|
67 |
+
|
68 |
+
This fine-tuned model goals are to experiment and to allow the authors to
|
69 |
+
gain skills and knowledge on how this process is carried out. The model
|
70 |
+
serves as basis for the development of a small [gradio-hosted](here) application
|
71 |
+
that transcribes recordings and audio files in italian. This application also
|
72 |
+
allows to insert a YouTube link of an Italian video ad gain a transciption.
|
73 |
+
|
74 |
+
The limitations of this project mainly regard the limited resources available
|
75 |
+
to fine-tune the model, namely Google Colab free-version and a Google Drive
|
76 |
+
used as feature storage, that had a limited space. The time dedicated to this
|
77 |
+
project was also limited, as it had to fit within academic deadlines.
|
78 |
+
|
79 |
+
## Training and evaluation data
|
80 |
+
|
81 |
+
The training was carried out on Google Colab platform, and the evalutation data
|
82 |
+
(as the whole dataset) was taken from the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
|
83 |
+
reducing the dataset to only 10% of the original dataset, to avoid the training the model for too much time.
|
84 |
+
|
85 |
+
## Training procedure
|
86 |
+
|
87 |
+
The training was conducted on Google Colab, using Jupyter Notebook to write code and document the training. Google Drive was used as Feature store.
|
88 |
+
Due to the limited resources of the free version of Google Colab, checkpointing was used to save partial results and then resume in a
|
89 |
+
following run. The notebook was run 15 times, with approximately 40 min for each 100 steps of training for a total of 26.5h of training.
|
90 |
+
Keep in mind that Google Colab was available to us for no more than 4 h a day, so around 7 days were necessary for training alone.
|
91 |
+
|
92 |
+
### Training hyperparameters
|
93 |
+
|
94 |
+
The following hyperparameters were used during training:
|
95 |
+
- learning_rate: 1e-05
|
96 |
+
- train_batch_size: 16
|
97 |
+
- eval_batch_size: 8
|
98 |
+
- training_steps: 4000
|
99 |
+
- gradient_accumulation_steps: 2
|
100 |
+
- save_steps: 100
|
101 |
+
- eval_steps: 100
|
102 |
+
|
103 |
+
### Training results
|
104 |
+
|
105 |
+
| Run Number | Step | Training Loss | Validation Loss | Wer |
|
106 |
+
|:-------------:|:------------:|:-----------------:|:------------------------------:|:--------------------------:|
|
107 |
+
| 1 | 100 | 1.2396 | 1.2330 | 176.40 |
|
108 |
+
| 2 | 200 | 0.7389 | 0.8331 | 80.49 |
|
109 |
+
| 2 | 300 | 0.2951 | 0.4261 | 70.20 |
|
110 |
+
| 2 | 400 | 0.2703 | 0.4051 | 101.60 |
|
111 |
+
| 3 | 500 | 0.2491 | 0.3923 | 112.20 |
|
112 |
+
| 3 | 600 | 0.1700 | 0.3860 | 107.10 |
|
113 |
+
| 3 | 700 | 0.1603 | 0.3836 | 90.36 |
|
114 |
+
| 4 | 800 | 0.1607 | 0.3786 | 135.00 |
|
115 |
+
| 4 | 900 | 0.1540 | 0.3783 | 99.05 |
|
116 |
+
| 4 | 1000 | 0.1562 | 0.3667 | 98.32 |
|
117 |
+
| 4 | 1100 | 0.0723 | 0.3757 | 158.90 |
|
118 |
+
| 5 | 1200 | 0.0769 | 0.3789 | 215.20 |
|
119 |
+
| 5 | 1300 | 0.0814 | 0.3779 | 170.50 |
|
120 |
+
| 5 | 1400 | 0.0786 | 0.3770 | 140.60 |
|
121 |
+
| 5 | 1500 | 0.0673 | 0.3777 | 137.10 |
|
122 |
+
| 6 | 1600 | 0.0339 | 0.3892 | 166.50 |
|
123 |
+
| 7 | 1700 | 0.0324 | 0.3963 | 170.90 |
|
124 |
+
| 7 | 1800 | 0.0348 | 0.4004 | 163.40 |
|
125 |
+
| 8 | 1900 | 0.0345 | 0.4016 | 158.60 |
|
126 |
+
| 8 | 2000 | 0.0346 | 0.4020 | 176.10 |
|
127 |
+
| 8 | 2100 | 0.0317 | 0.4001 | 134.70 |
|
128 |
+
| 9 | 2200 | 0.0173 | 0.4141 | 189.30 |
|
129 |
+
| 9 | 2300 | 0.0174 | 0.4106 | 175.00 |
|
130 |
+
| 9 | 2400 | 0.0165 | 0.4204 | 179.60 |
|
131 |
+
| 10 | 2500 | 0.0172 | 0.4185 | 186.10 |
|
132 |
+
| 10 | 2600 | 0.0142 | 0.4175 | 181.10 |
|
133 |
+
| 11 | 2700 | 0.0090 | 0.4325 | 161.70 |
|
134 |
+
| 11 | 2800 | 0.0069 | 0.4362 | 161.20 |
|
135 |
+
| 11 | 2900 | 0.0093 | 0.4342 | 157.50 |
|
136 |
+
| 12 | 3000 | 0.0076 | 0.4352 | 154.50 |
|
137 |
+
| 12 | 3100 | 0.0089 | 0.4394 | 184.30 |
|
138 |
+
| 13 | 3200 | 0.0063 | 0.4454 | 166.00 |
|
139 |
+
| 13 | 3300 | 0.0059 | 0.4476 | 179.20 |
|
140 |
+
| 13 | 3400 | 0.0058 | 0.4490 | 189.60 |
|
141 |
+
| 14 | 3500 | 0.0051 | 0.4502 | 194.20 |
|
142 |
+
| 14 | 3600 | 0.0064 | 0.4512 | 187.40 |
|
143 |
+
| 14 | 3700 | 0.0053 | 0.4520 | 190.20 |
|
144 |
+
| 14 | 3800 | 0.0049 | 0.4545 | 194.90 |
|
145 |
+
| 15 | 3900 | 0.0052 | 0.4546 | 199.60 |
|
146 |
+
| 15 | 4000 | 0.0054 | 0.4549 | 200.40 |
|
147 |
+
|
148 |
+
### Framework versions
|
149 |
+
- Transformers 4.36.0.dev0
|
150 |
+
- Pytorch 2.1.0+cu118
|
151 |
+
- Datasets 2.15.0
|
152 |
+
- Tokenizers 0.15.0
|