Kurmanji Tokenizer
This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.
Model Details
- Model Name: Kurmanji Tokenizer
- Language: Kurmanji Kurdish (kmr)
- Corpus Size: 50 million tokens
- Vocabulary Size: 52,000 tokens
- Tokenizer Type: Byte-Pair Encoding (BPE)
Training Data
The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.
Sources of the Corpus
- Kurdish Kurmanji website crawling
Usage
You can easily use this tokenizer with the Hugging Face transformers
library:
from transformers import PreTrainedTokenizerFast
# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")
# Example usage
text = "Navê min Ali ye."
tokens = tokenizer.encode(text)
print(tokens)