Wav2Vec2-Large-XLSR-53-German-GPT2
This is an encoder-decoder model for automatic speech recognition trained on on the MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - DE dataset. The encoder was initialized from jonatasgrosman/wav2vec2-large-xlsr-53-german and the decoder from dbmdz/german-gpt2.
It was trained using a two step process:
- fine-tuning only the cross-attention weights and the decoder using the pre-computed outputs of the Wav2Vec-Modell
- relatively fast training
- also works on small GPU (eg. 8 GB)
- but may take a lot of disk space
- should already yield decent results
- fine-tuning the model end-to-end
- much slower
- needs a bigger GPU
There is also one trick, which seemed to improve performance significantly: adding position embeddings to the
encoder outputs and initializing them with the pre-trained position embeddings of the GPT2 model (See eval.py
).
The training notebooks are still early drafts. Also results can probably improved a lot by using for example a learning rate schedule.
- Downloads last month
- 8
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Dataset used to train jsnfly/wav2vec2-large-xlsr-53-german-gpt2
Evaluation results
- Test WER on Common Voice 7self-reported10.020
- Test CER on Common Voice 7self-reported4.700