IA-Transliterated

IA-Transliterated is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages. All the languages are transliterated into the Devanagari script. It is subsequently evaluated on a set of diverse tasks.

The 11 languages covered by IA-Transliterated are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu.

The code can be found here. For more information, check-out our paper.

Pretraining Corpus

We pre-trained IA-Transliterated on the publicly available monolingual corpus. The corpus has the following distribution of languages:

Language	# Sentences	# Tokens
		# Total	# Unique
Hindi (hi)	1552.89	20,098.73	25.01
Bengali (bn)	353.44	4,021.30	6.5
Sanskrit (sa)	165.35	1,381.04	11.13
Urdu (ur)	153.27	2,465.48	4.61
Marathi (mr)	132.93	1,752.43	4.92
Gujarati (gu)	131.22	1,565.08	4.73
Nepali (ne)	84.21	1,139.54	3.43
Punjabi (pa)	68.02	945.68	2.00
Oriya (or)	17.88	274.99	1.10
Bhojpuri (bh)	10.25	134.37	1.13
Magahi (mag)	0.36	3.47	0.15

Evaluation Results

IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the paper.

Downloads

You can also download it from Huggingface.

Citing

If you are using any of the resources, please cite the following article:

@inproceedings{dhamecha-etal-2021-role,
    title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
    author = "Dhamecha, Tejas  and
      Murthy, Rudra  and
      Bharadwaj, Samarth  and
      Sankaranarayanan, Karthik  and
      Bhattacharyya, Pushpak",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.675",
    doi = "10.18653/v1/2021.emnlp-main.675",
    pages = "8584--8595",
}

Contributors

Tejas Dhamecha
Rudra Murthy
Samarth Bharadwaj
Karthik Sankaranarayanan
Pushpak Bhattacharyya

Contact

Rudra Murthy (rmurthyv@in.ibm.com)