Finished finetuning 🎉

Browse files

Files changed (4) hide show

README.md +31 -186
language_model/3gram.bin +2 -2
language_model/unigrams.txt +2 -2
vocab.json +44 -1

README.md CHANGED Viewed

@@ -3,205 +3,50 @@ library_name: transformers
 language:
 - da
 license: openrail
-base_model: chcaa/xls-r-300m-danish
-datasets:
-- alexandrainst/coral
-metrics:
-- wer
-- cer
 model-index:
-- name: roest-315m
-  results:
-  - task:
-      name: Automatic Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: CoRal read-aloud
-      type: alexandrainst/coral
-      split: test
-      args: read_aloud
-    metrics:
-    - name: CER
-      type: cer
-      value: 6.6% ± 0.2%
-    - name: WER
-      type: wer
-      value: 17.0% ± 0.4%
-  - task:
-      name: Automatic Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: Danish Common Voice 17
-      type: mozilla-foundation/common_voice_17_0
-      split: test
-      args: da
-    metrics:
-    - name: CER
-      type: cer
-      value: 6.6% ± 0.6%
-    - name: WER
-      type: wer
-      value: 16.7% ± 0.8%
-pipeline_tag: automatic-speech-recognition
 ---
-# Røst-315m
-This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
-Institute](https://alexandra.dk/).
-Try it out in [our interactive demo](https://huggingface.co/spaces/alexandrainst/roest-demo)!
-## Quick Start
-Start by installing the required libraries:
-```shell
-$ pip install transformers kenlm pyctcdecode
-```
-Next you can use the model using the `transformers` Python package as follows:
-```python
->>> from transformers import pipeline
->>> audio = get_audio()  # 16kHz raw audio array
->>> transcriber = pipeline(model="alexandrainst/roest-315m")
->>> transcriber(audio)
-{'text': 'your transcription'}
-```
-## Evaluation Results
-We have evaluated both our and existing models on the CoRal test set as well as the
-Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have
-bootstrapped the results 1000 times and report here the mean scores along with a 95%
-confidence interval (lower is better; best scores in **bold**, second-best in
-*italics*):
-| Model | Number of parameters | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER |
-|:---|---:|---:|---:|---:|---:|
-| Røst-315m (this model) | 315M | **6.6%** | **17.0%** | 6.6% ± 0.6% | 16.7% ± 0.8% |
-| [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) | 315M | 14.4% ± 0.3% | 36.5% ± 0.6% | **4.1% ± 0.5%** | **12.0% ± 0.8%** |
-| [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) | 1540M | 14.2% ± 0.5% | 33.2% ± 0.7% | *5.2% ± 0.4%* | *14.2% ± 0.8%* |
-| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | *11.4% ± 0.3%* | *28.3% ± 0.6%* | *5.5% ± 0.4%* | *14.8% ± 0.8%* |
-| [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) | 1540M | 13.9% ± 0.9% | 32.6% ± 1.2% | 7.2% ± 0.5% | 18.5% ± 0.9% |
-| [openai/whisper-large](https://hf.co/openai/whisper-large) | 1540M | 14.5% ± 0.3% | 35.4% ± 0.6% | 9.2% ± 0.5% | 22.9% ± 1.0% |
-| [openai/whisper-medium](https://hf.co/openai/whisper-medium) | 764M | 17.2% ± 1.3% | 40.5% ± 2.1% | 9.4% ± 0.5% | 24.0% ± 1.0% |
-| [openai/whisper-small](https://hf.co/openai/whisper-small) | 242M | 23.4% ± 1.2% | 55.2% ± 2.3% | 15.9% ± 1.0% | 38.9% ± 1.2% |
-| [openai/whisper-base](https://hf.co/openai/whisper-base) | 73M | 43.5% ± 3.1% | 89.3% ± 4.6% | 33.4% ± 4.7% | 71.4% ± 7.0% |
-| [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | 38M | 52.0% ± 2.5% | 103.7% ± 3.5% | 42.2% ± 3.9% | 83.6% ± 2.7% |
-### Detailed Evaluation Across Demographics on the CoRal Test Set
-![CER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-xlsr-comparison-cer-plot.png)
-![WER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-xlsr-comparison-wer-plot.png)
-## Training Data
-This model is the result of four different stages of training:
-  1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data,
-     13,628 hours of which is Danish. Pretraining here means that the model learnt to
-     "fill in" gaps of raw audio - no transcriptions were used (or available) during
-     this process. The pretraining data is distributed as follows:
-     - 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being
-       speeches from the European Parliament in 23 European languages.
-       This includes 13,600 hours of Danish speech.
-     - 51,000 hours from [Multilingual
-       LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in
-       8 European languages. This does not include any Danish speech.
-     - 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670),
-       being read-aloud speech in 60 diverse languages. This does not include any Danish
-       speech.
-     - 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459),
-       being audio from YouTube videos in 107 languages. This includes 28 hours of
-       Danish speech.
-     - 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being
-       conversational telephone speech in 17 African and Asian languages. This does not
-       include any Danish speech.
-  2. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning"
-     indicates that this stage of training was supervised, i.e. the model was trained on
-     both audio and transcriptions to perform the speech-to-text task (also known as
-     automatic speech recognition). The finetuning data is as follows:
-     - The read-aloud training split of the [CoRal
-       dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision
-       fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish
-       read-aloud speech, diverse across dialects, accents, ages and genders.
-     - The Danish training split of the [Common Voice 17
-       dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0),
-       consisting of 12 hours of Danish read-aloud speech.
-  3. An n-gram language model has been trained separately, and is used to guide the
-     transcription generation of the finetuned speech recognition model. This n-gram
-     language model has been trained on the following datasets:
-     - [Danish
-       Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da)
-       (approximately 287,000 articles).
-     - [Danish Common Voice 17 training
-       split](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da)
-       (approximately 3,500 comments).
-     - [Danish
-       Reddit](https://huggingface.co/datasets/alexandrainst/scandi-reddit/viewer/da)
-       (approximately 5 million comments).
-     Note that all samples from the CoRal test dataset have been removed from all of
-     these datasets, to ensure that the n-gram model has not seen the test data.
-The first step was trained by [Babu et al.
-(2021)](https://doi.org/10.48550/arXiv.2111.09296) and the second and third step by
-[Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m).
-The final product is then the combination of the finetuned model along with the n-gram
-model, and this is what is used when you use the model as mentioned in the Quick Start
-section above.
-## Intended use cases
-This model is intended to be used for Danish automatic speech recognition.
-Note that Biometric Identification is not allowed using the CoRal dataset and/or derived
-models. For more information, see addition 4 in our
-[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
-## Why the name Røst?
-Røst is both the [Danish word for the human
-voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one
-of the cold-water coral reefs in
-Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev).
-## License
-The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows
-commercial use with a few restrictions (speech synthesis and biometric identification).
-See
-[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
-## Creators and Funders
-The CoRal project is funded by the [Danish Innovation
-Fund](https://innovationsfonden.dk/) and consists of the following partners:
-- [Alexandra Institute](https://alexandra.dk/)
-- [University of Copenhagen](https://www.ku.dk/)
-- [Agency for Digital Government](https://digst.dk/)
-- [Alvenir](https://www.alvenir.ai/)
-- [Corti](https://www.corti.ai/)
-## Citation
-We will submit a research paper soon, but until then, if you use this model in your
-research or development, please cite it as follows:
-```bibtex
-@dataset{coral2024,
-  author    = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
-  title     = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
-  year      = {2024},
-  url       = {https://hf.co/datasets/alexandrainst/coral},
-}
-```

 language:
 - da
 license: openrail
+base_model: facebook/wav2vec2-xls-r-300m
+tags:
+- generated_from_trainer
 model-index:
+- name: roest-315m-xlsr
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# roest-315m-xlsr
+This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on an unknown dataset.
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.0001
+- train_batch_size: 256
+- eval_batch_size: 256
+- seed: 4242
+- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_steps: 1000
+- training_steps: 10000
+### Framework versions
+- Transformers 4.44.2
+- Pytorch 2.4.1+cu121
+- Datasets 3.0.0
+- Tokenizers 0.19.1

language_model/3gram.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:321f15f840bab9e5ccb62d7a94174b03c85d01e9bc118d5fbe87dcd1e9a2270c
-size 1016037798

 version https://git-lfs.github.com/spec/v1
+oid sha256:3ec877a2f9dad4e51bfcbdd0e32884b64a7662f722c7f37c77ea91dc3dea65db
+size 750711338

language_model/unigrams.txt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:51866dc0e5e69fa009540e723ca24499600b3767df241f7af9d8ae635128be22
-size 39660627

 version https://git-lfs.github.com/spec/v1
+oid sha256:683060ef402a6d88def5dc3ff15518b4d44e50ccb7ac12aad81a258d88fb5a72
+size 29668511

vocab.json CHANGED Viewed

	@@ -1 +1,44 @@
1	- {"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8": 8, "9": 9, "a": 10, "b": 11, "c": 12, "d": 13, "e": 14, "f": 15, "g": 16, "h": 17, "i": 18, "j": 19, "k": 20, "l": 21, "m": 22, "n": 23, "o": 24, "p": 25, "q": 26, "r": 27, "s": 28, "t": 29, "u": 30, "v": 31, "w": 32, "x": 33, "y": 34, "z": 35, "\|": 36, "\u00e5": 37, "\u00e6": 38, "\u00e9": 39, "\u00f8": 40, "\u00fc": 41}

+{
+  "0": 0,
+  "1": 1,
+  "2": 2,
+  "3": 3,
+  "4": 4,
+  "5": 5,
+  "6": 6,
+  "7": 7,
+  "8": 8,
+  "9": 9,
+  "a": 10,
+  "b": 11,
+  "c": 12,
+  "d": 13,
+  "e": 14,
+  "f": 15,
+  "g": 16,
+  "h": 17,
+  "i": 18,
+  "j": 19,
+  "k": 20,
+  "l": 21,
+  "m": 22,
+  "n": 23,
+  "o": 24,
+  "p": 25,
+  "q": 26,
+  "r": 27,
+  "s": 28,
+  "t": 29,
+  "u": 30,
+  "v": 31,
+  "w": 32,
+  "x": 33,
+  "y": 34,
+  "z": 35,
+  "|": 36,
+  "å": 37,
+  "æ": 38,
+  "é": 39,
+  "ø": 40,
+  "ü": 41
+}