BPE Tokenizer for Nepali LLM
- This repository contains a Byte Pair Encoding (BPE) tokenizer trained using the Hugging Face
transformers
package on ~30GB of Nepali LLM dataset (IRIISNEPAL/Nepali-Text-Corpus + nepberta-dataset). - The tokenizer has been optimized for handling Nepali text and is intended for use in language modeling and other natural language processing tasks.
Overview
- Tokenizer Type: Byte Pair Encoding (BPE)
- Vocabulary Size: 50,006
- Dataset Used: Nepali LLM Datasets
Special tokens
<id> <token>
0: <|endoftext|>
1: <|unk|>
50000: <|begin_of_text|>
50001: <|end_of_text|>
50002: <|start_header_id|>
50003: <|end_header_id|>
50004: <|eot_id|>
50005: '\n\n'
Installation
To use the tokenizer, you need to install the transformers
library. You can install it via pip:
pip install transformers
Usage
You can easily load the tokenizer using the following code:
from transformers import PreTrainedTokenizerFast
# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/NepaliBPE")
# Example usage
tokenizer.tokenize('राम ले भात खायो ।')
# ['राम</w>', 'ले</w>', 'भात</w>', 'खायो</w>', '।</w>']
tokenizer.encode('राम ले भात खायो ।')
# [1621, 285, 14413, 27675, 251]
tokenizer.decode([1621, 285, 14413, 27675, 251])
# राम ले भात खायो ।
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.