Training
Browse files- TRAINING.md +82 -0
TRAINING.md
ADDED
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Background
|
2 |
+
|
3 |
+
The [original Whisper models](https://cdn.openai.com/papers/whisper.pdf) were trained using 680.000 hours via Large-Scale Weak Supervision of dataset build from the Internet.
|
4 |
+
Human curated datasets like Common Voice or Fleur were not using during training. The hypothesis is that by fine-tuning Whisper
|
5 |
+
using human curated datasets the quality can improve for a language or a particular domain.
|
6 |
+
|
7 |
+
These Whisper fine tuning models for Catalan language were created during early 2023 as part of the Hugging Face Whisper Sprint.
|
8 |
+
|
9 |
+
The models were finetuned using [these scripts](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event)
|
10 |
+
|
11 |
+
For fine tuning we used Common Voice dataset version 11.
|
12 |
+
|
13 |
+
# Learnings on fine-tuning Whisper models
|
14 |
+
|
15 |
+
The goal of the rest of the document is just to share knowledge in the spirit that can benefit others. Things as shared as it is.
|
16 |
+
|
17 |
+
**1. Model improves when benchmarched against Common Voice**.
|
18 |
+
|
19 |
+
The model improves in WER evaluation metric when it is evaluated against the Common Voice test dataset. If you take the sample of the [small](https://huggingface.co/softcatala/whisper-small-ca) fine-tuned model, you see how the final WER is 8.5 while starting at WER 13.
|
20 |
+
|
21 |
+
**2. Model degrades according to human evaluation**
|
22 |
+
|
23 |
+
When doing human evaliuation the results for finetuned Catalan language model were disapointing. The fine-tuned models clearly perform worse than the original OpenAI models as reported by all users (half dozen) that test them.
|
24 |
+
|
25 |
+
Our hypothesis is that the evaluation on Common Voice gives better results because the model is overfitted and has lost generalization capabilities.
|
26 |
+
|
27 |
+
**2. Model degrades according evaluation with other datasets**
|
28 |
+
|
29 |
+
Doing a more extensive evaluation shows:
|
30 |
+
|
31 |
+
| | base | sc-base | small | sc-small |medium | sc-medium |
|
32 |
+
| ----------- | ----------- | ----------- | ----------- |----------- | ----------- | ----------- |
|
33 |
+
| 15GdH9-curt | 55.60 | 70.40 | 36.62 |78.75 | 78.75 | 22.39|49.53
|
34 |
+
| Ona_catalan-balear | 71.28 | 71.01 | 44.68 |49.20 | 49.20 | 28.72 |56.65
|
35 |
+
| Son_Goku_catalan_valencian_voice | 51.90 | 85.44 | 39.87 |65.19 | 18.99| 71.52
|
36 |
+
| Universal_Declaration_of_Human_Rights | 47.12 | 36.45 | 39.14 |75.59 | 44.37 | 27.79
|
37 |
+
|
38 |
+
As you can see,
|
39 |
+
|
40 |
+
Legend:
|
41 |
+
* "sc-" Indicates Softcatalà fine-tuned model
|
42 |
+
* The scores are WER metrics
|
43 |
+
|
44 |
+
|
45 |
+
**4. Whisper fine tuned models clients provide different quality**
|
46 |
+
|
47 |
+
Summary as March 2023:
|
48 |
+
|
49 |
+
**a**. OpenAI Whisper implementation does not support out of the box inference on fine-tuned models, only on OpenAI models.
|
50 |
+
|
51 |
+
**b**. HuggingFace Whisper implementation performs poorly. This can be really misleading when doing evaluations, since HuggingFace is the stack used for fine-tuning
|
52 |
+
|
53 |
+
In our experiments
|
54 |
+
|
55 |
+
| Whisper Client | WER |
|
56 |
+
| ----------- | ----------- |
|
57 |
+
| OpenAI | 27.32 |
|
58 |
+
| Whisper.cpp 1.2.1 | 38.89 |
|
59 |
+
| HuggingFace | 93.54 |
|
60 |
+
| CTranslate2 3.10.3 | 43.68 |
|
61 |
+
|
62 |
+
We strongly recommend using CTranslate2 as inference client.
|
63 |
+
|
64 |
+
|
65 |
+
**5. Fine-tunning degrades timestamp prediction**
|
66 |
+
|
67 |
+
Whisper uses timestamp tokens to indicate the timestamps of the transcribed texts.
|
68 |
+
|
69 |
+
The training scripts available for fine tunning do not generate timestamps tokens and the timestamp prediciton degradates.
|
70 |
+
|
71 |
+
This is important since many people uses Whisper models to create video subtitles were the timestamp prediction is important
|
72 |
+
|
73 |
+
|
74 |
+
**Next steps**
|
75 |
+
|
76 |
+
There is the possibility that fine-tuning can not improve Whisper or only in certain scenarios (domain adaptation). See [Nickolay Shmyrev](https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html) vision and tests.
|
77 |
+
|
78 |
+
Potential next steps:
|
79 |
+
|
80 |
+
* The training scripts have to be improved to include timestamp tokens
|
81 |
+
* New trainings should include more corpus than just Common Voice
|
82 |
+
|