fav-kky
/

wav2vec2-base-cs-50k

@@ -9,36 +9,47 @@ license: "cc-by-nc-sa-4.0"
 # wav2vec2-base-cs-50k
 This is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 50 thousand hours of Czech speech.
-This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.
-## Speech recognition results
-After fine-tuning, the model scored the following results on public datasets:
-- Czech portion of CommonVoice v16.0: **WER = 11.36%**
-See our paper for details.
 ## Paper
-The preprint of our paper (accepted to INTERSPEECH 2024) is available at [tbd]
-### All models released within the paper
-- https://huggingface.co/fav-kky/wav2vec2-base-cs-50k (monolingual Czech)
-- https://huggingface.co/fav-kky/wav2vec2-base-de-50k (monolingual German)
-- https://huggingface.co/fav-kky/wav2vec2-base-cs-en-100k (bilingual Czech+English)
-- https://huggingface.co/fav-kky/wav2vec2-base-cs-de-100k (bilingual Czech+German)
-- https://huggingface.co/fav-kky/wav2vec2-base-en-de-100k (bilingual English+German)
-- https://huggingface.co/fav-kky/wav2vec2-base-cs-en-de-150k (trilingual Czech+English+German)
 ## Citation
 If you find this model useful, please cite our paper:
 ```
-tbd
 ```
 ## Usage
 Inputs must be 16kHz mono audio files.
-This model can be used e.g. to extract per-frame contextual embeddings from audio:
 ```python
 from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
 import torchaudio
@@ -57,4 +68,10 @@ output = model(inputs)
 embeddings = output.last_hidden_state.detach().numpy()[0]
 ```
 ## Related works

 # wav2vec2-base-cs-50k
 This is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 50 thousand hours of Czech speech.
+It has been released along with a paper **A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for
+Automatic Speech Recognition in Multilingual Oral History Archives** accepted to INTERSPEECH2024 conference.
 ## Paper
+The pre-print of our paper is available at http://arxiv.org/abs/2407.17160.
+### All pre-trained models released along with the paper
+- [fav-kky/wav2vec2-base-cs-50k](https://huggingface.co/fav-kky/wav2vec2-base-cs-50k) (monolingual Czech)
+- [fav-kky/wav2vec2-base-de-50k](https://huggingface.co/fav-kky/wav2vec2-base-de-50k) (monolingual German)
+- [fav-kky/wav2vec2-base-cs-en-100k](https://huggingface.co/fav-kky/wav2vec2-base-cs-en-100k) (bilingual Czech+English)
+- [fav-kky/wav2vec2-base-cs-de-100k](https://huggingface.co/fav-kky/wav2vec2-base-cs-de-100k) (bilingual Czech+German)
+- [fav-kky/wav2vec2-base-en-de-100k](https://huggingface.co/fav-kky/wav2vec2-base-en-de-100k) (bilingual English+German)
+- [fav-kky/wav2vec2-base-cs-en-de-150k](https://huggingface.co/fav-kky/wav2vec2-base-cs-en-de-150k) (trilingual Czech+English+German)
 ## Citation
 If you find this model useful, please cite our paper:
 ```
+@inproceedings{lehecka2024bitrilingual,
+  title = {{A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives}},
+  author = {
+    Jan Lehe\v{c}ka and
+    Josef V. Psutka and
+    Lubo\v{s} \v{S}m\'{i}dl and
+    Pavel Ircing and
+    Josef Psutka
+  },
+  booktitle={Proc. Interspeech 2024},
+  note={In Press},
+  year={2024},
+  url={https://arxiv.org/abs/2407.17160},
+}
 ```
 ## Usage
+This model does not have a tokenizer as it was pretrained on audio alone.
+In order to use this model for speech recognition, a tokenizer should be created
+and the model should be [fine-tuned](https://huggingface.co/blog/fine-tune-wav2vec2-english) on labeled ASR data.
 Inputs must be 16kHz mono audio files.
+This model can be used e.g., to extract per-frame contextual embeddings from audio:
 ```python
 from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
 import torchaudio
 embeddings = output.last_hidden_state.detach().numpy()[0]
 ```
+## Speech recognition results
+After fine-tuning, the model scored the following results on public datasets:
+- Czech portion of CommonVoice v16.0: **WER = 11.36%**
+See our paper for details.
 ## Related works