aubmindlab
/

aragpt2-base

@@ -63,7 +63,7 @@ Follow the guide linked [here](https://towardsdatascience.com/fine-tuning-gpt2-o
 ## Finetuning using our code with TF 1.15.4:
-- Create the Training TFRecords:
 ```bash
 python create_pretraining_data.py
  --input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
@@ -71,7 +71,7 @@ python create_pretraining_data.py
  --tokenizer_dir=<Directory with the GPT2 Tokenizer files>
  ```
- - Finetuning:
  ```bash
  python3 run_pretraining.py \
  --input_file="gs://<GS_BUCKET>/pretraining_data/*" \
@@ -119,7 +119,7 @@ The pretraining data used for the new AraGPT2 model is also used for **AraBERTv2
 The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
-For the new dataset we added the unshuffled OSCAR corpus, after we thoroughly filter it, to the dataset used in AraBERTv1 but with out the websites that we previously crawled:
 - OSCAR unshuffled and filtered.
 - [Arabic Wikipedia dump](https://archive.org/details/arwiki-20190201) from 2020/09/01
 - [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
@@ -133,13 +133,18 @@ The text generated by AraGPT2 is automatically generated by a neural network mod
 # If you used this model please cite us as :
 ```
-@misc{antoun2020aragpt2,
-      title={AraGPT2: Pre-Trained Transformer for Arabic Language Generation},
-      author={Wissam Antoun and Fady Baly and Hazem Hajj},
-      year={2020},
-      eprint={2012.15520},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
 }
 ```

 ## Finetuning using our code with TF 1.15.4:
+Create the Training TFRecords:
 ```bash
 python create_pretraining_data.py
  --input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
  --tokenizer_dir=<Directory with the GPT2 Tokenizer files>
  ```
+ Finetuning:
  ```bash
  python3 run_pretraining.py \
  --input_file="gs://<GS_BUCKET>/pretraining_data/*" \
 The dataset consists of 77GB or 200,095,961 lines or 8,655,948,860 words or 82,232,988,358 chars (before applying Farasa Segmentation)
+For the new dataset we added the unshuffled OSCAR corpus after we thoroughly filter it, to the dataset used in AraBERTv1 but without the websites that we previously crawled:
 - OSCAR unshuffled and filtered.
 - [Arabic Wikipedia dump](https://archive.org/details/arwiki-20190201) from 2020/09/01
 - [The 1.5B words Arabic Corpus](https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4)
 # If you used this model please cite us as :
 ```
+@inproceedings{antoun-etal-2021-aragpt2,
+    title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
+    author = "Antoun, Wissam  and
+      Baly, Fady  and
+      Hajj, Hazem",
+    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
+    month = apr,
+    year = "2021",
+    address = "Kyiv, Ukraine (Virtual)",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
+    pages = "196--207",
 }
 ```