Silemo
/

whisper-it

 ---
+language:
+- it
 license: apache-2.0
+base_model: openai/whisper-small
+tags:
+- hf-asr-leaderboard
+- generated_from_trainer
+metrics:
+- wer
+model-index[0]:
+- name: Whisper Small IT
+- name: results
+  results:
+  - task:
+      name: Text Classification
+      type: text-classification
+    dataset:
+      name: emotion
+      type: emotion
+      args: default
+    metrics:
+    - name: Accuracy
+      type: accuracy
+      value: 0.925
+    - name: F1
+      type: f1
+      value: 0.9251012149383893
+datasets:
+- mozilla-foundation/common_voice_11_0
 ---
+# Whisper Small - Italian
+This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small)
+on the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0).
+It achieves the following results on the evaluation set:
+- Loss: 0.4549
+- Wer: 200.40
+## Model description
+Whisper is a pre-trained model for automatic speech recognition (ASR)
+published in [September 2022](https://openai.com/blog/whisper/) by the authors
+Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as
+[Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained
+on un-labelled audio data, Whisper is pre-trained on a vast quantity of
+**labelled** audio-transcription data, 680,000 hours to be precise.
+This is an order of magnitude more data than the un-labelled audio data used
+to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this
+pre-training data is multilingual ASR data. This results in checkpoints
+that can be applied to over 96 languages, many of which are considered
+_low-resource_.
+When scaled to 680,000 hours of labelled pre-training data, Whisper models
+demonstrate a strong ability to generalise to many datasets and domains.
+The pre-trained checkpoints achieve competitive results to state-of-the-art
+ASR systems, with near 3% word error rate (WER) on the test-clean subset of
+LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._
+Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).
+The extensive multilingual ASR knowledge acquired by Whisper during pre-training
+can be leveraged for other low-resource languages; through fine-tuning, the
+pre-trained checkpoints can be adapted for specific datasets and languages
+to further improve upon these results.
+## Intended uses & limitations
+This fine-tuned model goals are to experiment and to allow the authors to
+gain skills and knowledge on how this process is carried out. The model
+serves as basis for the development of a small [gradio-hosted](here) application
+that transcribes recordings and audio files in italian. This application also
+allows to insert a YouTube link of an Italian video ad gain a transciption.
+The limitations of this project mainly regard the limited resources available
+to fine-tune the model, namely Google Colab free-version and a Google Drive
+used as feature storage, that had a limited space. The time dedicated to this
+project was also limited, as it had to fit within academic deadlines.
+## Training and evaluation data
+The training was carried out on Google Colab platform, and the evalutation data
+(as the whole dataset) was taken from the [Common-voice-11.0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
+reducing the dataset to only 10% of the original dataset, to avoid the training the model for too much time.
+## Training procedure
+The training was conducted on Google Colab, using Jupyter Notebook to write code and document the training. Google Drive was used as Feature store.
+Due to the limited resources of the free version of Google Colab, checkpointing was used to save partial results and then resume in a
+following run. The notebook was run 15 times, with approximately 40 min for each 100 steps of training for a total of 26.5h of training.
+Keep in mind that Google Colab was available to us for no more than 4 h a day, so around 7 days were necessary for training alone.
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 1e-05
+- train_batch_size: 16
+- eval_batch_size: 8
+- training_steps: 4000
+- gradient_accumulation_steps: 2
+- save_steps: 100
+- eval_steps: 100
+### Training results
+| Run Number    | Step         | Training Loss     | Validation Loss                | Wer                        |
+|:-------------:|:------------:|:-----------------:|:------------------------------:|:--------------------------:|
+| 1             | 100          | 1.2396            | 1.2330                         | 176.40                     |
+| 2             | 200          | 0.7389            | 0.8331                         |  80.49                     |
+| 2             | 300          | 0.2951            | 0.4261                         |  70.20                     |
+| 2             | 400          | 0.2703            | 0.4051                         | 101.60                     |
+| 3             | 500          | 0.2491            | 0.3923                         | 112.20                     |
+| 3             | 600          | 0.1700            | 0.3860                         | 107.10                     |
+| 3             | 700          | 0.1603            | 0.3836                         |  90.36                     |
+| 4             | 800          | 0.1607            | 0.3786                         | 135.00                     |
+| 4             | 900          | 0.1540            | 0.3783                         |  99.05                     |
+| 4             | 1000         | 0.1562            | 0.3667                         |  98.32                     |
+| 4             | 1100         | 0.0723            | 0.3757                         | 158.90                     |
+| 5             | 1200         | 0.0769            | 0.3789                         | 215.20                     |
+| 5             | 1300         | 0.0814            | 0.3779                         | 170.50                     |
+| 5             | 1400         | 0.0786            | 0.3770                         | 140.60                     |
+| 5             | 1500         | 0.0673            | 0.3777                         | 137.10                     |
+| 6             | 1600         | 0.0339            | 0.3892                         | 166.50                     |
+| 7             | 1700         | 0.0324            | 0.3963                         | 170.90                     |
+| 7             | 1800         | 0.0348            | 0.4004                         | 163.40                     |
+| 8             | 1900         | 0.0345            | 0.4016                         | 158.60                     |
+| 8             | 2000         | 0.0346            | 0.4020                         | 176.10                     |
+| 8             | 2100         | 0.0317            | 0.4001                         | 134.70                     |
+| 9             | 2200         | 0.0173            | 0.4141                         | 189.30                     |
+| 9             | 2300         | 0.0174            | 0.4106                         | 175.00                     |
+| 9             | 2400         | 0.0165            | 0.4204                         | 179.60                     |
+| 10            | 2500         | 0.0172            | 0.4185                         | 186.10                     |
+| 10            | 2600         | 0.0142            | 0.4175                         | 181.10                     |
+| 11            | 2700         | 0.0090            | 0.4325                         | 161.70                     |
+| 11            | 2800         | 0.0069            | 0.4362                         | 161.20                     |
+| 11            | 2900         | 0.0093            | 0.4342                         | 157.50                     |
+| 12            | 3000         | 0.0076            | 0.4352                         | 154.50                     |
+| 12            | 3100         | 0.0089            | 0.4394                         | 184.30                     |
+| 13            | 3200         | 0.0063            | 0.4454                         | 166.00                     |
+| 13            | 3300         | 0.0059            | 0.4476                         | 179.20                     |
+| 13            | 3400         | 0.0058            | 0.4490                         | 189.60                     |
+| 14            | 3500         | 0.0051            | 0.4502                         | 194.20                     |
+| 14            | 3600         | 0.0064            | 0.4512                         | 187.40                     |
+| 14            | 3700         | 0.0053            | 0.4520                         | 190.20                     |
+| 14            | 3800         | 0.0049            | 0.4545                         | 194.90                     |
+| 15            | 3900         | 0.0052            | 0.4546                         | 199.60                     |
+| 15            | 4000         | 0.0054            | 0.4549                         | 200.40                     |
+### Framework versions
+- Transformers 4.36.0.dev0
+- Pytorch 2.1.0+cu118
+- Datasets 2.15.0
+- Tokenizers 0.15.0