File size: 3,002 Bytes
8301504 7f3cc26 8301504 7f3cc26 8301504 7f3cc26 8301504 7f3cc26 34c2de6 7f3cc26 214cbd9 7f3cc26 214cbd9 7f3cc26 214cbd9 7f3cc26 a356678 7f3cc26 214cbd9 5cd2240 214cbd9 7f3cc26 214cbd9 7f3cc26 214cbd9 7f3cc26 ef88c7a 7f3cc26 8301504 64f9e5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
---
tags:
- espnet
- audio
- automatic-speech-recognition
language: et
license: cc-by-4.0
---
# Estonian Espnet2 ASR model
## Model description
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
## Intended uses & limitations
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
## How to use
```python
from espnet2.bin.asr_inference import Speech2Text
model = Speech2Text.from_pretrained(
"TalTechNLP/espnet2_estonian",
lm_weight=0.6, ctc_weight=0.4, beam_size=60
)
# read a sound file with 16k sample rate
import soundfile
speech, rate = soundfile.read("speech.wav")
assert rate == 16000
text, *_ = model(speech)
print(text[0])
```
#### Limitations and bias
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
* Speech containing technical and other domain-specific terms
* Children's speech
* Non-native speech
* Speech recorded under very noisy conditions or with a microphone far from the speaker
* Very spontaneous and overlapping speech
## Training data
Acoustic training data:
| Type | Amount (h) |
|-----------------------|:------:|
| Broadcast speech | 591 |
| Spontaneous speech | 53 |
| Elderly speech corpus | 53 |
| Talks, lectures | 49 |
| Parliament speeches | 31 |
| *Total* | *761* |
Language model training data:
* Estonian National Corpus 2019
* OpenSubtitles
* Speech transcripts
## Training procedure
Standard EspNet2 Conformer recipe.
## Evaluation results
### WER
|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset|2864|56575|93.1|4.5|2.4|2.0|8.9|63.4|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset|273|4677|93.9|3.6|2.4|1.2|7.3|46.5|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset|818|11093|94.7|2.7|2.5|0.9|6.2|45.0|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset|1207|13865|82.3|8.5|9.3|3.4|21.2|74.1|
|decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset|1648|22707|86.4|7.6|6.0|2.5|16.1|75.7|
### BibTeX entry and citation info
#### Citing ESPnet
```BibTex
@inproceedings{watanabe2018espnet,
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
title={{ESPnet}: End-to-End Speech Processing Toolkit},
year={2018},
booktitle={Proceedings of Interspeech},
pages={2207--2211},
doi={10.21437/Interspeech.2018-1456},
url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
``` |