Edit model card

KuBERT: Central Kurdish BERT Model

Introduction

KuBERT-Central-Kurdish-BERT-Model harnesses the BERT framework to enhance computational linguistics for the Central Kurdish language. This initiative is a response to the scarcity of resources and computational models for Kurdish, which is a language with substantial linguistic diversity.

Data Acquisition for Model Training

Data collection is a significant hurdle in training deep learning models, especially for low-resource languages like Kurdish. Sourcing sufficient data is essential for the efficacy of complex models such as BERT. The scarcity of digital resources makes accumulating Kurdish data more challenging than for many other languages. To amass a comprehensive word vector dataset for Kurdish, substantial efforts were made to compile information from various sources.

Corpus Compilation

Three main corpora were utilized to train the Kurdish BERT model, amounting to 296.5 million tokens:

  • AsoSoft corpus: With 188 million tokens, it includes data from websites, textbooks, and magazines.
  • AramRafeq and Muhammad Azizi corpus: A collection of over 60 million tokens gathered from Kurdish websites.
  • Oscar 2019 corpus: Comprising 48.5 million words, it further enriches the dataset.

This comprehensive text corpus ensures that the KuBERT model is well-equipped to understand and process Kurdish at a high level.

Overview

The project uses the latest advances in BERT technology to better understand and process Kurdish language data. The model training incorporates a Kurdish-specific tokenizer and various classifiers, demonstrating BERT's adaptability to linguistic intricacies.

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('asosoft/KuBERT-Central-Kurdish-BERT-Model') model = BertModel.from_pretrained('asosoft/KuBERT-Central-Kurdish-BERT-Model')

Contributions

The integration of BERT represents a significant step forward in computational linguistics for Kurdish, providing a much-needed benchmark for future NLP efforts in under-represented languages. By leveraging a large corpus of Kurdish text, this project addresses critical gaps in language processing tools for Kurdish.

Training Details

The BERT model undergoes extensive fine-tuning with the curated Kurdish dataset, ensuring optimal performance. Through rigorous training and evaluation, the model is prepared to handle a variety of linguistic tasks.

Final Remarks

This README encapsulates the essence of the KuBERT-Central-Kurdish-BERT-Model project, its data acquisition efforts, and the innovative use of BERT for the Kurdish language. For a full understanding of the model's capabilities and comprehensive training details, the full documentation and accompanying study materials should be consulted.

Relevant Links and References


*Epochs: 3 *Max Token Length: 256 *Learning Rate: 1.00E-05 *Dropout Rate: 0.3 *Batch Size: 8 *GPU Utilization: Yes


The corpus data tables and the detailed methodology can be found in the full research paper and are summarized here for quick reference:

Corpus Data Tables Summary

Table 1: AsoSoft Kurdish Text Corpus

Source Number of Tokens
Crawled From Websites 95M
Text Books 45M
Magazines 48M
Sum 188M

Table 2: Muhammad Azizi and AramRafeq Text Corpus

Source Number of Tokens
Wikipedia 13.5M
Wishe Website 11M
Speemedia Website 6.5M
Kurdiu Website 19M
Dengiamerika Website 2M
Chawg Website 8M
Sum 60M

Table 3: The Kurdish Text Corpus Used to Train BERT

Corpus Name Number of Tokens
Oscar 2019 corpus 48.5M
AsoSoft corpus 188M
Muhammad Azizi and AramRafeq corpus 60M
Sum 296.5M

Cite

If you are using our text corpus cite us.

Hadi Veisi, Kozhin muhealddin Awlla, Abdulhady Abas Abdullah; KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis , fqy074, https://doi.org/10.1093/llc/fqy074

@article{veisi2020toward,
  title={KuBERT: Central Kurdish BERT Model and Its Application for   Sentiment Analysis },
  author={Veisi, Hadi and muhealddin, Kozhin and Abas, Abdulhady},
  journal={},
  volume={35},
  number={1},
  pages={},
  year={2024},
  publisher={}
}
Downloads last month
34
Safetensors
Model size
81.9M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.