Assamese Tokenizer (50K Vocabulary)

Model Details

This repository contains a custom tokenizer for the Assamese language with a vocabulary size of 50,000 tokens. The tokenizer was trained on the Assamese language subset of the CC-100 multilingual dataset. This tokenizer can be used for various Natural Language Processing (NLP) tasks involving the Assamese language.

Repository Details

Repository Name: tamang0000/assamese-tokenizer-50k
Tokenizer Vocabulary Size: 50,000 tokens
Training Dataset: CC-100 Multilingual Dataset (Assamese Language Subset)
Model Type: Tokenizer
Framework: Hugging Face Transformers
License: MIT License

Tokenizer Usage

You can load and use this tokenizer with the Hugging Face transformers library. Below are the steps to load and use the tokenizer in your projects.

Training Details

Dataset: The tokenizer was trained exclusively on the Assamese language subset of the CC-100 multilingual dataset.
Vocabulary Size: 50,000 tokens.
Normalization: Includes normalization steps such as lowercasing and stripping accents.

tamang0000
/

assamese-tokenizer-50k

Assamese Tokenizer (50K Vocabulary)

Model Details

Repository Details

Tokenizer Usage

Training Details

Space using tamang0000/assamese-tokenizer-50k 1