Edit model card

Kurmanji Tokenizer

This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.

Model Details

  • Model Name: Kurmanji Tokenizer
  • Language: Kurmanji Kurdish (kmr)
  • Corpus Size: 50 million tokens
  • Vocabulary Size: 52,000 tokens
  • Tokenizer Type: Byte-Pair Encoding (BPE)

Training Data

The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.

Sources of the Corpus

  • Kurdish Kurmanji website crawling

Usage

You can easily use this tokenizer with the Hugging Face transformers library:

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")

# Example usage
text = "Navê min Ali ye."
tokens = tokenizer.encode(text)
print(tokens)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .