AraT5-MSAizer
This model is a fine-tuned version of UBC-NLP/AraT5v2-base-1024 for translating five regional Arabic dialects into Modern Standard Arabic (MSA).
Intended uses & limitations
This model was developed to participate in Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools. It was only evaluated on the development and test datasets provided by the task organizers.
Training and evaluation data
The model was fine-tuned on a blend of four distinct datasets; three of which comprised 'gold' parallel MSA-dialect sentence pairs. The fourth dataset, considered 'silver', was generated through back-translation from MSA to dialect.
Gold parallel corpora
- The Multi-Arabic Dialects Application and Resources (MADAR)
- The North Levantine Corpus
- The Parallel Arabic DIalect Corpus (PADIC)
Synthetic Data A back-translated subset of the Arabic sentences in OPUS
Evaluation results
BLEU score on the development split of Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools.
Model | BLEU |
---|---|
AraT5-MSAizer. | 0.2302 |
Official evaluation results on the held-out test split
Model | BLEU | Comet DA |
---|---|---|
AraT5-MSAizer | 0.2179 | 0.0016 |
Training procedure
The model was trained by fully fine-tuning UBC-NLP/AraT5v2-base-1024 for one epoch only. The maximum input length is set to 1024 (same as in the original pre-trained model) whereas the maximum generation length is set to 512.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- warmup_ratio: 0.05
- num_epochs: 1
Full training script and configuration can be found on https://github.com/Murhaf/AraT5-MSAizer
Training results
Framework versions
- Transformers 4.38.1
- Pytorch 2.0.1
- Datasets 2.17.1
- Tokenizers 0.15.2
- Downloads last month
- 26,634
Model tree for Murhaf/AraT5-MSAizer
Base model
UBC-NLP/AraT5v2-base-1024