jordimas commited on
Commit
69769f1
·
1 Parent(s): 73215f9
Files changed (1) hide show
  1. TRAINING.md +82 -0
TRAINING.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Background
2
+
3
+ The [original Whisper models](https://cdn.openai.com/papers/whisper.pdf) were trained using 680.000 hours via Large-Scale Weak Supervision of dataset build from the Internet.
4
+ Human curated datasets like Common Voice or Fleur were not using during training. The hypothesis is that by fine-tuning Whisper
5
+ using human curated datasets the quality can improve for a language or a particular domain.
6
+
7
+ These Whisper fine tuning models for Catalan language were created during early 2023 as part of the Hugging Face Whisper Sprint.
8
+
9
+ The models were finetuned using [these scripts](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event)
10
+
11
+ For fine tuning we used Common Voice dataset version 11.
12
+
13
+ # Learnings on fine-tuning Whisper models
14
+
15
+ The goal of the rest of the document is just to share knowledge in the spirit that can benefit others. Things as shared as it is.
16
+
17
+ **1. Model improves when benchmarched against Common Voice**.
18
+
19
+ The model improves in WER evaluation metric when it is evaluated against the Common Voice test dataset. If you take the sample of the [small](https://huggingface.co/softcatala/whisper-small-ca) fine-tuned model, you see how the final WER is 8.5 while starting at WER 13.
20
+
21
+ **2. Model degrades according to human evaluation**
22
+
23
+ When doing human evaliuation the results for finetuned Catalan language model were disapointing. The fine-tuned models clearly perform worse than the original OpenAI models as reported by all users (half dozen) that test them.
24
+
25
+ Our hypothesis is that the evaluation on Common Voice gives better results because the model is overfitted and has lost generalization capabilities.
26
+
27
+ **2. Model degrades according evaluation with other datasets**
28
+
29
+ Doing a more extensive evaluation shows:
30
+
31
+ | | base | sc-base | small | sc-small |medium | sc-medium |
32
+ | ----------- | ----------- | ----------- | ----------- |----------- | ----------- | ----------- |
33
+ | 15GdH9-curt | 55.60 | 70.40 | 36.62 |78.75 | 78.75 | 22.39|49.53
34
+ | Ona_catalan-balear | 71.28 | 71.01 | 44.68 |49.20 | 49.20 | 28.72 |56.65
35
+ | Son_Goku_catalan_valencian_voice | 51.90 | 85.44 | 39.87 |65.19 | 18.99| 71.52
36
+ | Universal_Declaration_of_Human_Rights | 47.12 | 36.45 | 39.14 |75.59 | 44.37 | 27.79
37
+
38
+ As you can see,
39
+
40
+ Legend:
41
+ * "sc-" Indicates Softcatalà fine-tuned model
42
+ * The scores are WER metrics
43
+
44
+
45
+ **4. Whisper fine tuned models clients provide different quality**
46
+
47
+ Summary as March 2023:
48
+
49
+ **a**. OpenAI Whisper implementation does not support out of the box inference on fine-tuned models, only on OpenAI models.
50
+
51
+ **b**. HuggingFace Whisper implementation performs poorly. This can be really misleading when doing evaluations, since HuggingFace is the stack used for fine-tuning
52
+
53
+ In our experiments
54
+
55
+ | Whisper Client | WER |
56
+ | ----------- | ----------- |
57
+ | OpenAI | 27.32 |
58
+ | Whisper.cpp 1.2.1 | 38.89 |
59
+ | HuggingFace | 93.54 |
60
+ | CTranslate2 3.10.3 | 43.68 |
61
+
62
+ We strongly recommend using CTranslate2 as inference client.
63
+
64
+
65
+ **5. Fine-tunning degrades timestamp prediction**
66
+
67
+ Whisper uses timestamp tokens to indicate the timestamps of the transcribed texts.
68
+
69
+ The training scripts available for fine tunning do not generate timestamps tokens and the timestamp prediciton degradates.
70
+
71
+ This is important since many people uses Whisper models to create video subtitles were the timestamp prediction is important
72
+
73
+
74
+ **Next steps**
75
+
76
+ There is the possibility that fine-tuning can not improve Whisper or only in certain scenarios (domain adaptation). See [Nickolay Shmyrev](https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html) vision and tests.
77
+
78
+ Potential next steps:
79
+
80
+ * The training scripts have to be improved to include timestamp tokens
81
+ * New trainings should include more corpus than just Common Voice
82
+