--- license: apache-2.0 language: - en library_name: transformers --- # **ArlowGPT Tokenizer** ### Overview The **ArlowGPT Tokenizer** is a byte pair encoding (BPE) tokenizer developed from scratch, optimized for large-scale language modeling and text generation tasks. It features a vocabulary size of **59,575 tokens** and supports a maximum context length of **131,072 tokens**, making it suitable for handling extremely long documents and sequences. ### Key Features - **Vocabulary Size**: 59,575 tokens - **Maximum Context Length**: 131,072 tokens - **Tokenizer Type**: Byte Pair Encoding (BPE) - **Special Tokens**: - ``: Padding token used for sequence alignment. - ``: Special token for masked language modeling tasks. - ``: End-of-sequence token. - ``: Beginning-of-sequence token. - **Trained From Scratch**: The tokenizer was trained from scratch using a large corpus of English and multilingual text. ### Training Data The tokenizer was trained on **Wikipedia**, ensuring high coverage of general knowledge and domain-specific terms. Although primarily optimized for English, it also includes some multilingual capability due to the nature of the training dataset. ### Intended Use Cases This tokenizer is designed for **general-purpose language modeling** and is suitable for tasks such as: - Autoregressive text generation - Long-context summarization - Conversational AI - Information retrieval over large documents - General NLP tasks requiring long context processing ### Supported Languages - **Primary Language**: English - **Secondary Support**: Some multilingual content ### Performance & Benchmarks No formal benchmarks have been conducted yet, but the tokenizer has been designed for efficiency in both tokenization speed and memory usage, with a focus on handling extremely long contexts up to **131,072 tokens**. ### Limitations - **Multilingual Coverage**: While the tokenizer includes some multilingual tokens, it is primarily optimized for English text, and performance on non-English languages may vary. - **No Benchmarked Metrics**: The tokenizer has not undergone formal benchmarking for speed or performance across various tasks. ### Citation If you use the **ArlowGPT Tokenizer** in your work, please cite it as: ``` @misc{arlowgpt_tokenizer, title={ArlowGPT Tokenizer}, author={yuchenxie}, year={2025}, howpublished={\url{https://huggingface.co/yuchenxie/ArlowGPT-Tokenizer}} } ```