morenolq commited on
Commit
3a6685b
1 Parent(s): 0d396c7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text2text-generation
6
+ license: mit
7
+ datasets:
8
+ - PeacefulData/HyPoradise-v0
9
+ library_name: transformers
10
+ pipeline_tag: text2text-generation
11
+ widget:
12
+ - text: |2-
13
+ Generate the correct transcription for the following n-best list of ASR hypotheses:
14
+
15
+ 1. nebode also typically is symphons and an ankle surf leash 2. neboda is also typically is symphons and an ankle surf leash 3. nebode also typically is swim fins and an ankle surf leash 4. neboda also typically is symphons and an ankle surf leash 5. neboda is also typically is swim fins and an ankle surf leash'
16
+ example_title: Trascription correction
17
+ base_model:
18
+ - google/flan-t5-base
19
+ ---
20
+
21
+ # FLANEC: Exploring FLAN-T5 for Post-ASR Error Correction
22
+
23
+ ## Model Overview
24
+
25
+ FLANEC is an encoder-decoder model based on FLAN-T5, specifically fine-tuned for post-Automatic Speech Recognition (ASR) error correction, also known as Generative Speech Error Correction (GenSEC). The model utilizes n-best hypotheses from ASR systems to enhance the accuracy and grammaticality of final transcriptions by generating a single corrected output. FLANEC models are trained on diverse subsets of the [HyPoradise dataset](https://huggingface.co/datasets/PeacefulData/HyPoradise-v0), leveraging multiple ASR domains to provide robust, scalable error correction across different types of audio data.
26
+
27
+ FLANEC was developed for the **GenSEC Task 1 challenge at SLT 2024** - [Challenge website](https://sites.google.com/view/gensec-challenge/home).
28
+
29
+ ## Model Checkpoints
30
+
31
+ **Cumulative Dataset (CD) Models trained with full fine-tuning:**
32
+
33
+ - [FLANEC Base CD](https://huggingface.co/morenolq/flanec-base-cd): Base model with ~250 million parameters, fine-tuned for post-ASR correction on cumulative datasets.
34
+ - [FLANEC Large CD](https://huggingface.co/morenolq/flanec-large-cd): Large model with ~800 million parameters, fine-tuned for post-ASR correction on cumulative datasets.
35
+ - [FLANEC XL CD](https://huggingface.co/morenolq/flanec-xl-cd): Extra-large model with ~3 billion parameters, fine-tuned for post-ASR correction on cumulative datasets.
36
+
37
+ **Cumulative Dataset (CD) Models trained with Low-Rank Adaptation (LoRA):**
38
+
39
+ - [FLANEC Base LoRA](https://huggingface.co/morenolq/flanec-base-cd-lora): Base model with ~250 million parameters, fine-tuned with LoRA on cumulative datasets.
40
+ - [FLANEC Large LoRA](https://huggingface.co/morenolq/flanec-large-cd-lora): Large model with ~800 million parameters, fine-tuned with LoRA on cumulative datasets.
41
+ - [FLANEC XL LoRA](https://huggingface.co/morenolq/flanec-xl-cd-lora): Extra-large model with ~3 billion parameters, fine-tuned with LoRA on cumulative datasets.
42
+
43
+ ## Intended Use
44
+
45
+ FLANEC is designed for the task of Generative Speech Error Correction (GenSEC). The model is suitable for post-processing ASR outputs to correct grammatical and linguistic errors. The model supports the **English** language.
46
+
47
+ ## Training Details
48
+
49
+ ### Datasets
50
+
51
+ FLANEC is trained on the [HyPoradise dataset](https://huggingface.co/datasets/PeacefulData/HyPoradise-v0), which contains data from eight ASR domains:
52
+
53
+ 1. **WSJ**: Business and financial news.
54
+ 2. **ATIS**: Airline travel queries.
55
+ 3. **CHiME-4**: Noisy speech.
56
+ 4. **Tedlium-3**: TED talks.
57
+ 5. **CV-accent**: Accented speech.
58
+ 6. **SwitchBoard**: Conversational speech.
59
+ 7. **LRS2**: BBC program audio.
60
+ 8. **CORAAL**: Accented speech from African American English.
61
+
62
+ For more details, see the [HyPoradise paper](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6492267465a7ac507be1f9fd1174e78d-Abstract-Datasets_and_Benchmarks.html).
63
+
64
+ ### Training Strategy
65
+
66
+ The model has been fine-tuned using both full fine-tuning and LoRA (Low-Rank Adaptation) methods. Fine-tuning was performed on multiple model scales, ranging from 250M to 3B parameters. Both single-dataset (SD) and cumulative dataset (CD) training approaches were employed to assess model performance across different ASR domains.
67
+
68
+ For more information on the training strategy, refer to the SLT 2024 paper.
69
+
70
+ ## Citation
71
+
72
+ Please use the following citation to reference this work in your research:
73
+
74
+ **citation will be updated soon after SLT 2024 proceedings are published**
75
+
76
+ ```bibtex
77
+ @inproceedings{moreno2024flanec,
78
+ title={FLANEC: Exploring FLAN-T5 for Post-ASR Error Correction},
79
+ author={La Quatra, Moreno and Salerno, Valerio and Tsao, Yu and Sabato Marco, Siniscalchi},
80
+ booktitle={Proceedings of the 2024 IEEE Workshop on Spoken Language Technology},
81
+ year={2024}
82
+ }
83
+ ```