softcatala
/

whisper-small-ca

+# Background
+The [original Whisper models](https://cdn.openai.com/papers/whisper.pdf) were trained using 680.000 hours via Large-Scale Weak Supervision of dataset build from the Internet.
+Human curated datasets like Common Voice or Fleur were not using during training. The hypothesis is that by fine-tuning Whisper
+using human curated datasets the quality can improve for a language or a particular domain.
+These Whisper fine tuning models for Catalan language were created during early 2023 as part of the Hugging Face Whisper Sprint.
+The models were finetuned using [these scripts](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event)
+For fine tuning we used Common Voice dataset version 11.
+# Learnings on fine-tuning Whisper models
+The goal of the rest of the document is just to share knowledge in the spirit that can benefit others. Things as shared as it is.
+**1. Model improves when benchmarched against Common Voice**.
+The model improves in WER evaluation metric when it is evaluated against the Common Voice test dataset. If you take the sample of the [small](https://huggingface.co/softcatala/whisper-small-ca) fine-tuned model, you see how the final WER is 8.5 while starting at WER 13.
+**2. Model degrades according to human evaluation**
+When doing human evaliuation the results for finetuned Catalan language model were disapointing. The fine-tuned models clearly perform worse than the original OpenAI models as reported by all users (half dozen) that test them.
+Our hypothesis is that the evaluation on Common Voice gives better results because the model is overfitted and has lost generalization capabilities.
+**2. Model degrades according evaluation with other datasets**
+Doing a more extensive evaluation shows:
+|             | base        | sc-base     | small       | sc-small   |medium       | sc-medium   |
+| ----------- | ----------- | ----------- | ----------- |----------- | ----------- | ----------- |
+| 15GdH9-curt     | 55.60           | 70.40           | 36.62           |78.75           | 78.75           | 22.39|49.53
+| Ona_catalan-balear      | 71.28           | 71.01           | 44.68           |49.20           | 49.20           | 28.72 |56.65
+| Son_Goku_catalan_valencian_voice      | 51.90           | 85.44           | 39.87           |65.19           | 18.99| 71.52
+| Universal_Declaration_of_Human_Rights      | 47.12           | 36.45           | 39.14          |75.59           | 44.37           | 27.79
+As you can see,
+Legend:
+* "sc-" Indicates Softcatalà fine-tuned model
+* The scores are WER metrics
+**4. Whisper fine tuned models clients provide different quality**
+Summary as March 2023:
+**a**. OpenAI Whisper implementation does not support out of the box inference on fine-tuned models, only on OpenAI models.
+**b**. HuggingFace Whisper implementation performs poorly. This can be really misleading when doing evaluations, since HuggingFace is the stack used for fine-tuning
+In our experiments
+| Whisper Client | WER |
+| ----------- | ----------- |
+| OpenAI      | 27.32       |
+| Whisper.cpp 1.2.1   | 38.89 |
+| HuggingFace   | 93.54  |
+| CTranslate2 3.10.3  | 43.68  |
+We strongly recommend using CTranslate2 as inference client.
+**5. Fine-tunning degrades timestamp prediction**
+Whisper uses timestamp tokens to indicate the timestamps of the transcribed texts.
+The training scripts available for fine tunning do not generate timestamps tokens and the timestamp prediciton degradates.
+This is important since many people uses Whisper models to create video subtitles were the timestamp prediction is important
+**Next steps**
+There is the possibility that fine-tuning can not improve Whisper or only in certain scenarios (domain adaptation). See [Nickolay Shmyrev](https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html) vision and tests.
+Potential next steps:
+* The training scripts have to be improved to include timestamp tokens
+* New trainings should include more corpus than just Common Voice