lothritz
/

LuxemBERT

Inference Endpoints

Model card Files Files and versions Community

lothritz commited on Oct 1, 2022

Commit

69a0c5a

·

1 Parent(s): 4605822

Update README.md

Files changed (1) hide show

README.md +16 -2

README.md CHANGED Viewed

@@ -4,5 +4,19 @@ LuxemBERT is a BERT model for the Luxembourgish language.
 It was trained using 6.1 million Luxembourgish sentences from various sources including the Luxembourgish Wikipedia, the Leipzig Corpora Collection and rtl.lu.
 In addition, we partially translated 6.1 million sentences from the German Wikipedia from German to Luxembourgish as means of data augmentation. This gave us a dataset of 12.2 million sentences we used to train our LuxemBERT model.
-If you use our model, please cite our paper:
-[Will be added later]

 It was trained using 6.1 million Luxembourgish sentences from various sources including the Luxembourgish Wikipedia, the Leipzig Corpora Collection and rtl.lu.
 In addition, we partially translated 6.1 million sentences from the German Wikipedia from German to Luxembourgish as means of data augmentation. This gave us a dataset of 12.2 million sentences we used to train our LuxemBERT model.
+If you would like to use our model, please cite our paper:
+```
+@InProceedings{lothritz-EtAl:2022:LREC,
+author = {Lothritz, Cedric and Lebichot, Bertrand and Allix, Kevin and Veiber, Lisa and BISSYANDE, TEGAWENDE and Klein, Jacques and Boytsov, Andrey and Lefebvre, Clément and Goujon, Anne},
+title = {LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish},
+booktitle = {Proceedings of the Language Resources and Evaluation Conference},
+month = {June},
+year = {2022},
+address = {Marseille, France},
+publisher = {European Language Resources Association},
+pages = {5080--5089},
+abstract = {Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.},
+url = {https://aclanthology.org/2022.lrec-1.543}
+}
+```