KuBERT: Central Kurdish BERT Model
Introduction
KuBERT-Central-Kurdish-BERT-Model harnesses the BERT framework to enhance computational linguistics for the Central Kurdish language. This initiative is a response to the scarcity of resources and computational models for Kurdish, which is a language with substantial linguistic diversity.
Data Acquisition for Model Training
Data collection is a significant hurdle in training deep learning models, especially for low-resource languages like Kurdish. Sourcing sufficient data is essential for the efficacy of complex models such as BERT. The scarcity of digital resources makes accumulating Kurdish data more challenging than for many other languages. To amass a comprehensive word vector dataset for Kurdish, substantial efforts were made to compile information from various sources.
Corpus Compilation
Three main corpora were utilized to train the Kurdish BERT model, amounting to 296.5 million tokens:
- AsoSoft corpus: With 188 million tokens, it includes data from websites, textbooks, and magazines.
- AramRafeq and Muhammad Azizi corpus: A collection of over 60 million tokens gathered from Kurdish websites.
- Oscar 2019 corpus: Comprising 48.5 million words, it further enriches the dataset.
This comprehensive text corpus ensures that the KuBERT model is well-equipped to understand and process Kurdish at a high level.
Overview
The project uses the latest advances in BERT technology to better understand and process Kurdish language data. The model training incorporates a Kurdish-specific tokenizer and various classifiers, demonstrating BERT's adaptability to linguistic intricacies.
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('asosoft/KuBERT-Central-Kurdish-BERT-Model') model = BertModel.from_pretrained('asosoft/KuBERT-Central-Kurdish-BERT-Model')
Contributions
The integration of BERT represents a significant step forward in computational linguistics for Kurdish, providing a much-needed benchmark for future NLP efforts in under-represented languages. By leveraging a large corpus of Kurdish text, this project addresses critical gaps in language processing tools for Kurdish.
Training Details
The BERT model undergoes extensive fine-tuning with the curated Kurdish dataset, ensuring optimal performance. Through rigorous training and evaluation, the model is prepared to handle a variety of linguistic tasks.
Final Remarks
This README encapsulates the essence of the KuBERT-Central-Kurdish-BERT-Model project, its data acquisition efforts, and the innovative use of BERT for the Kurdish language. For a full understanding of the model's capabilities and comprehensive training details, the full documentation and accompanying study materials should be consulted.
Relevant Links and References
- Oscar 2019 corpus: https://oscar-corpus.com/post/oscar-2019/
- AsoSoft Kurdish Text Corpus: https://github.com/AsoSoft/AsoSoft-Text-Corpus
- Kurdish Resources by Muhammad Azizi and AramRafeq: https://github.com/DevelopersTree/KurdishResources/
*Epochs: 3 *Max Token Length: 256 *Learning Rate: 1.00E-05 *Dropout Rate: 0.3 *Batch Size: 8 *GPU Utilization: Yes
The corpus data tables and the detailed methodology can be found in the full research paper and are summarized here for quick reference:
Corpus Data Tables Summary
Table 1: AsoSoft Kurdish Text Corpus
Source | Number of Tokens |
---|---|
Crawled From Websites | 95M |
Text Books | 45M |
Magazines | 48M |
Sum | 188M |
Table 2: Muhammad Azizi and AramRafeq Text Corpus
Source | Number of Tokens |
---|---|
Wikipedia | 13.5M |
Wishe Website | 11M |
Speemedia Website | 6.5M |
Kurdiu Website | 19M |
Dengiamerika Website | 2M |
Chawg Website | 8M |
Sum | 60M |
Table 3: The Kurdish Text Corpus Used to Train BERT
Corpus Name | Number of Tokens |
---|---|
Oscar 2019 corpus | 48.5M |
AsoSoft corpus | 188M |
Muhammad Azizi and AramRafeq corpus | 60M |
Sum | 296.5M |
Cite
If you are using our text corpus cite us.
Hadi Veisi, Kozhin muhealddin Awlla, Abdulhady Abas Abdullah; KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis , fqy074, https://doi.org/10.1093/llc/fqy074
@article{veisi2020toward,
title={KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis },
author={Veisi, Hadi and muhealddin, Kozhin and Abas, Abdulhady},
journal={},
volume={35},
number={1},
pages={},
year={2024},
publisher={}
}
- Downloads last month
- 34