BPE Tokenizer for Nepali LLM

This repository contains a Byte Pair Encoding (BPE) tokenizer trained using the Hugging Face transformers package on ~30GB of Nepali LLM dataset (IRIISNEPAL/Nepali-Text-Corpus + nepberta-dataset).
The tokenizer has been optimized for handling Nepali text and is intended for use in language modeling and other natural language processing tasks.

Overview

Tokenizer Type: Byte Pair Encoding (BPE)
Vocabulary Size: 50,006
Dataset Used: Nepali LLM Datasets

Special tokens

<id>        <token>

0:        <|endoftext|>
1:        <|unk|>
50000:    <|begin_of_text|>
50001:    <|end_of_text|>
50002:    <|start_header_id|>
50003:    <|end_header_id|>
50004:    <|eot_id|>
50005:    '\n\n'

Installation

To use the tokenizer, you need to install the transformers library. You can install it via pip:

pip install transformers

Usage

You can easily load the tokenizer using the following code:

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/NepaliBPE")

# Example usage
tokenizer.tokenize('राम ले भात खायो ।')
# ['राम</w>', 'ले</w>', 'भात</w>', 'खायो</w>', '।</w>']

tokenizer.encode('राम ले भात खायो ।')
# [1621, 285, 14413, 27675, 251]

tokenizer.decode([1621, 285, 14413, 27675, 251])
# राम ले भात खायो ।

Aananda-giri
/

NepaliBPE

BPE Tokenizer for Nepali LLM

Overview

Special tokens

Installation

Usage

Space using Aananda-giri/NepaliBPE 1