Update README.md
Browse files
README.md
CHANGED
@@ -4,5 +4,19 @@ LuxemBERT is a BERT model for the Luxembourgish language.
|
|
4 |
It was trained using 6.1 million Luxembourgish sentences from various sources including the Luxembourgish Wikipedia, the Leipzig Corpora Collection and rtl.lu.
|
5 |
In addition, we partially translated 6.1 million sentences from the German Wikipedia from German to Luxembourgish as means of data augmentation. This gave us a dataset of 12.2 million sentences we used to train our LuxemBERT model.
|
6 |
|
7 |
-
If you use our model, please cite our paper:
|
8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
It was trained using 6.1 million Luxembourgish sentences from various sources including the Luxembourgish Wikipedia, the Leipzig Corpora Collection and rtl.lu.
|
5 |
In addition, we partially translated 6.1 million sentences from the German Wikipedia from German to Luxembourgish as means of data augmentation. This gave us a dataset of 12.2 million sentences we used to train our LuxemBERT model.
|
6 |
|
7 |
+
If you would like to use our model, please cite our paper:
|
8 |
+
|
9 |
+
```
|
10 |
+
@InProceedings{lothritz-EtAl:2022:LREC,
|
11 |
+
author = {Lothritz, Cedric and Lebichot, Bertrand and Allix, Kevin and Veiber, Lisa and BISSYANDE, TEGAWENDE and Klein, Jacques and Boytsov, Andrey and Lefebvre, Clément and Goujon, Anne},
|
12 |
+
title = {LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish},
|
13 |
+
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
|
14 |
+
month = {June},
|
15 |
+
year = {2022},
|
16 |
+
address = {Marseille, France},
|
17 |
+
publisher = {European Language Resources Association},
|
18 |
+
pages = {5080--5089},
|
19 |
+
abstract = {Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.},
|
20 |
+
url = {https://aclanthology.org/2022.lrec-1.543}
|
21 |
+
}
|
22 |
+
```
|