ArlowGPT Tokenizer

Overview

The ArlowGPT Tokenizer is a byte pair encoding (BPE) tokenizer developed from scratch, optimized for large-scale language modeling and text generation tasks. It features a vocabulary size of 59,575 tokens and supports a maximum context length of 131,072 tokens, making it suitable for handling extremely long documents and sequences.

Key Features

Vocabulary Size: 59,575 tokens
Maximum Context Length: 131,072 tokens
Tokenizer Type: Byte Pair Encoding (BPE)
Special Tokens:
- <pad>: Padding token used for sequence alignment.
- <mask>: Special token for masked language modeling tasks.
- <eos>: End-of-sequence token.
- <bos>: Beginning-of-sequence token.
Trained From Scratch: The tokenizer was trained from scratch using a large corpus of English and multilingual text.

Training Data

The tokenizer was trained on Wikipedia, ensuring high coverage of general knowledge and domain-specific terms. Although primarily optimized for English, it also includes some multilingual capability due to the nature of the training dataset.

Intended Use Cases

This tokenizer is designed for general-purpose language modeling and is suitable for tasks such as:

Autoregressive text generation
Long-context summarization
Conversational AI
Information retrieval over large documents
General NLP tasks requiring long context processing

Supported Languages

Primary Language: English
Secondary Support: Some multilingual content

Performance & Benchmarks

No formal benchmarks have been conducted yet, but the tokenizer has been designed for efficiency in both tokenization speed and memory usage, with a focus on handling extremely long contexts up to 131,072 tokens.

Limitations

Multilingual Coverage: While the tokenizer includes some multilingual tokens, it is primarily optimized for English text, and performance on non-English languages may vary.
No Benchmarked Metrics: The tokenizer has not undergone formal benchmarking for speed or performance across various tasks.

Citation

If you use the ArlowGPT Tokenizer in your work, please cite it as:

@misc{arlowgpt_tokenizer,
  title={ArlowGPT Tokenizer},
  author={yuchenxie},
  year={2025},
  howpublished={\url{https://huggingface.co/yuchenxie/ArlowGPT-Tokenizer}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuchenxie/ArlowGPT-Tokenizer

Finetunes

1 model