jlehecka commited on
Commit
539eba1
1 Parent(s): e06d317

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -18
README.md CHANGED
@@ -9,36 +9,47 @@ license: "cc-by-nc-sa-4.0"
9
 
10
  # wav2vec2-base-cs-50k
11
  This is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 50 thousand hours of Czech speech.
12
-
13
- This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.
14
-
15
- ## Speech recognition results
16
- After fine-tuning, the model scored the following results on public datasets:
17
- - Czech portion of CommonVoice v16.0: **WER = 11.36%**
18
-
19
- See our paper for details.
20
 
21
  ## Paper
22
- The preprint of our paper (accepted to INTERSPEECH 2024) is available at [tbd]
23
 
24
- ### All models released within the paper
25
- - https://huggingface.co/fav-kky/wav2vec2-base-cs-50k (monolingual Czech)
26
- - https://huggingface.co/fav-kky/wav2vec2-base-de-50k (monolingual German)
27
- - https://huggingface.co/fav-kky/wav2vec2-base-cs-en-100k (bilingual Czech+English)
28
- - https://huggingface.co/fav-kky/wav2vec2-base-cs-de-100k (bilingual Czech+German)
29
- - https://huggingface.co/fav-kky/wav2vec2-base-en-de-100k (bilingual English+German)
30
- - https://huggingface.co/fav-kky/wav2vec2-base-cs-en-de-150k (trilingual Czech+English+German)
31
 
32
  ## Citation
33
  If you find this model useful, please cite our paper:
34
  ```
35
- tbd
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ```
37
 
38
  ## Usage
 
 
 
 
39
  Inputs must be 16kHz mono audio files.
40
 
41
- This model can be used e.g. to extract per-frame contextual embeddings from audio:
42
  ```python
43
  from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
44
  import torchaudio
@@ -57,4 +68,10 @@ output = model(inputs)
57
  embeddings = output.last_hidden_state.detach().numpy()[0]
58
  ```
59
 
 
 
 
 
 
 
60
  ## Related works
 
9
 
10
  # wav2vec2-base-cs-50k
11
  This is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 50 thousand hours of Czech speech.
12
+ It has been released along with a paper **A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for
13
+ Automatic Speech Recognition in Multilingual Oral History Archives** accepted to INTERSPEECH2024 conference.
 
 
 
 
 
 
14
 
15
  ## Paper
16
+ The pre-print of our paper is available at http://arxiv.org/abs/2407.17160.
17
 
18
+ ### All pre-trained models released along with the paper
19
+ - [fav-kky/wav2vec2-base-cs-50k](https://huggingface.co/fav-kky/wav2vec2-base-cs-50k) (monolingual Czech)
20
+ - [fav-kky/wav2vec2-base-de-50k](https://huggingface.co/fav-kky/wav2vec2-base-de-50k) (monolingual German)
21
+ - [fav-kky/wav2vec2-base-cs-en-100k](https://huggingface.co/fav-kky/wav2vec2-base-cs-en-100k) (bilingual Czech+English)
22
+ - [fav-kky/wav2vec2-base-cs-de-100k](https://huggingface.co/fav-kky/wav2vec2-base-cs-de-100k) (bilingual Czech+German)
23
+ - [fav-kky/wav2vec2-base-en-de-100k](https://huggingface.co/fav-kky/wav2vec2-base-en-de-100k) (bilingual English+German)
24
+ - [fav-kky/wav2vec2-base-cs-en-de-150k](https://huggingface.co/fav-kky/wav2vec2-base-cs-en-de-150k) (trilingual Czech+English+German)
25
 
26
  ## Citation
27
  If you find this model useful, please cite our paper:
28
  ```
29
+ @inproceedings{lehecka2024bitrilingual,
30
+ title = {{A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives}},
31
+ author = {
32
+ Jan Lehe\v{c}ka and
33
+ Josef V. Psutka and
34
+ Lubo\v{s} \v{S}m\'{i}dl and
35
+ Pavel Ircing and
36
+ Josef Psutka
37
+ },
38
+ booktitle={Proc. Interspeech 2024},
39
+ note={In Press},
40
+ year={2024},
41
+ url={https://arxiv.org/abs/2407.17160},
42
+ }
43
  ```
44
 
45
  ## Usage
46
+ This model does not have a tokenizer as it was pretrained on audio alone.
47
+ In order to use this model for speech recognition, a tokenizer should be created
48
+ and the model should be [fine-tuned](https://huggingface.co/blog/fine-tune-wav2vec2-english) on labeled ASR data.
49
+
50
  Inputs must be 16kHz mono audio files.
51
 
52
+ This model can be used e.g., to extract per-frame contextual embeddings from audio:
53
  ```python
54
  from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
55
  import torchaudio
 
68
  embeddings = output.last_hidden_state.detach().numpy()[0]
69
  ```
70
 
71
+ ## Speech recognition results
72
+ After fine-tuning, the model scored the following results on public datasets:
73
+ - Czech portion of CommonVoice v16.0: **WER = 11.36%**
74
+
75
+ See our paper for details.
76
+
77
  ## Related works