Assamese Tokenizer (50K Vocabulary)
Model Details
This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language.
Repository Details
- Repository Name: tamang0000/assamese-tokenizer-50k
- Tokenizer Vocabulary Size: 50,000 tokens
- Training Dataset: CC-100 Multilingual Dataset (Assamese Language Subset)
- Model Type: Tokenizer
- Framework: Hugging Face Transformers
- License: MIT License
Tokenizer Usage
You can load and use this tokenizer with the Hugging Face transformers
library. Below are the steps to load and use the tokenizer in your projects.
Training Details
- Dataset: The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset.
- Vocabulary Size: 50,000 tokens.
- Normalization: Includes normalization steps such as lowercasing and stripping accents.