|
# Background |
|
|
|
The [original Whisper models](https://cdn.openai.com/papers/whisper.pdf) were trained using 680.000 hours via Large-Scale Weak Supervision of dataset build from the Internet. |
|
Human curated datasets like Common Voice or Fleur were not using during training. The hypothesis is that by fine-tuning Whisper |
|
using human curated datasets the quality can improve for a language or a particular domain. |
|
|
|
These Whisper fine tuning models for Catalan language were created during early 2023 as part of the Hugging Face Whisper Sprint. |
|
|
|
The models were finetuned using [these scripts](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event) |
|
|
|
For fine tuning we used Common Voice dataset version 11. |
|
|
|
# Learnings on fine-tuning Whisper models |
|
|
|
The goal of the rest of the document is just to share knowledge in the spirit that can benefit others. Things as shared as it is. |
|
|
|
**1. Model improves when benchmarched against Common Voice**. |
|
|
|
The model improves in WER evaluation metric when it is evaluated against the Common Voice test dataset. If you take the sample of the [small](https://huggingface.co/softcatala/whisper-small-ca) fine-tuned model, you see how the final WER is 8.5 while starting at WER 13. |
|
|
|
**2. Model degrades according to human evaluation** |
|
|
|
When doing human evaluation the results for finetuned Catalan language model were disapointing. The fine-tuned models clearly perform worse than the original OpenAI models as reported by all users (half dozen) that test them. |
|
|
|
Our hypothesis is that the evaluation on Common Voice gives better results because the model is overfitted and has lost generalization capabilities. |
|
|
|
**3. Model degrades according evaluation with other datasets** |
|
|
|
Results doing an evaluation with other datasets: |
|
|
|
| | base | sc-base | small | sc-small |medium | sc-medium | |
|
| ----------- | ----------- | ----------- | ----------- |----------- | ----------- | ----------- | |
|
| 15GdH9-curt | 55.60 | 70.40 | 36.62 |78.75 | 78.75 | 22.39|49.53 |
|
| Ona_catalan-balear | 71.28 | 71.01 | 44.68 |49.20 | 49.20 | 28.72 |56.65 |
|
| Son_Goku_catalan_valencian_voice | 51.90 | 85.44 | 39.87 |65.19 | 18.99| 71.52 |
|
| Universal_Declaration_of_Human_Rights | 47.12 | 36.45 | 39.14 |75.59 | 44.37 | 27.79 |
|
|
|
As you can see, the fine-tunned models perform worse in most of the scenarios than OpenAI models. |
|
|
|
Legend: |
|
* "sc-" Indicates Softcatalà fine-tuned model |
|
* The scores are WER metrics |
|
|
|
|
|
**4. Whisper fine tuned models clients provide different quality** |
|
|
|
Summary as March 2023: |
|
|
|
**a**. OpenAI Whisper implementation does not support out of the box inference on fine-tuned models, only on OpenAI models. |
|
|
|
**b**. HuggingFace Whisper implementation performs poorly. This can be really misleading when doing evaluations, since HuggingFace is the stack used for fine-tuning |
|
|
|
**c**. We have only been able to use the models reliable with Whisper.cpp and CTranslate 2 inference clients. |
|
|
|
See how diferent clients can impact the WER over when doing inference same file: |
|
|
|
| Whisper Client | WER | |
|
| ----------- | ----------- | |
|
| OpenAI | 27.32 | |
|
| Whisper.cpp 1.2.1 | 38.89 | |
|
| HuggingFace | 69.63 | |
|
| CTranslate2 3.10.3 | 28.08 | |
|
|
|
We strongly recommend using CTranslate2 as inference client. |
|
|
|
**5. Fine-tunning degrades timestamp prediction** |
|
|
|
Whisper uses timestamp tokens to indicate the timestamps of the transcribed texts. |
|
|
|
The training scripts available for fine tunning do not generate timestamps tokens and the timestamp prediciton degradates. |
|
|
|
This is important since many people uses Whisper models to create video subtitles were the timestamp prediction is important |
|
|
|
|
|
**Next steps** |
|
|
|
There is the possibility that fine-tuning can not improve Whisper or only in certain scenarios (domain adaptation). See [Nickolay Shmyrev](https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html) vision and tests. |
|
|
|
Potential next steps: |
|
|
|
* The training scripts have to be improved to include timestamp tokens |
|
* New trainings should include more corpus than just Common Voice |
|
|
|
|