QiTianTokenizer-Large
QiTianTokenizer is a universal multilingual tokenizer primarily optimized for Chinese–English mixed text,
offering consistent and reversible tokenization across diverse languages and scripts.
It is designed as a general-purpose tokenizer, not tied to any specific model,
and fully compatible with the 🤗 Transformers ecosystem.
✨ Overview
| Property | Value |
|---|---|
| Name | QiTianTokenizer-Large |
| Type | Tokenizer-only repository |
| Purpose | General multilingual tokenization |
| Primary Languages | Chinese, English |
| Extended Support | Multilingual (Unicode-complete) |
| Architecture | Byte-level BPE |
| Vocabulary Size | 96,000 tokens |
| Fast Implementation | ✅ Available (QiTianTokenizerFast) |
| Framework | 🤗 transformers |
| License | Apache 2.0 |
🧩 QiTian Tokenizer Series
| Variant | Vocabulary Size | Description | Recommended Use |
|---|---|---|---|
| QiTianTokenizer-Tiny | 12k | Lightweight tokenizer designed for compact or embedded models. | On-device or low-resource tasks |
| QiTianTokenizer-Base | 32k | Balanced vocabulary offering solid coverage and efficiency for most multilingual use cases. | Recommended for general use |
| QiTianTokenizer-Medium | 64k | Optimal balance in language coverage — broad enough to capture fine-grained linguistic diversity while maintaining reasonable model complexity. | Recommended for multilingual and high-quality general-purpose models |
| QiTianTokenizer-Large | 96k | Extended multilingual vocabulary designed for diverse cross-lingual pretraining and high-capacity language models. | High-resource training |
| QiTianTokenizer-XLarge | 128k | Full-script and domain-extensive vocabulary for comprehensive multilingual modeling. | Research & large-scale pretraining |
All variants share consistent token definitions, special tokens, and compatible configurations.
⚙️ Usage
You can load this tokenizer directly with AutoTokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Large", trust_remote_code=True)
# Example
text = "你好,QiTian!"
tokens = tokenizer(text)
print(tokens["input_ids"])
➕ Batch Example
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Large", trust_remote_code=True)
# Example
texts = ["Hello, 世界!", "QiTian is multilingual."]
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
print(batch_tokens["input_ids"])
📦 Files Included
| File | Description |
|---|---|
tokenizer.json |
Serialized fast tokenizer definition |
tokenizer_config.json |
Configuration (max length, padding side, etc.) |
special_tokens_map.json |
Special token mappings |
tokenizer.py |
Tokenizer implementation |
🔍 Special Tokens
| Token | Example | Purpose |
|---|---|---|
<|bos|> |
<|bos|> |
Beginning of sequence (BOS) |
<|eos|> |
<|eos|> |
End of sequence (EOS) |
<|pad|> |
<|pad|> |
Padding token for batch alignment |
<|mask|> |
<|mask|> |
Masked token for MLM-style objectives |
<|user|> |
<|user|> |
Marks user message boundary in conversational data |
<|assistant|> |
<|assistant|> |
Marks assistant message boundary |
<|system|> |
<|system|> |
Defines system or meta-instruction context |
<|think|> |
<|think|> |
Reasoning-phase delimiter — marks model’s internal reasoning or structured thinking segment during inference |
All tokens are integrated into the tokenizer vocabulary and appear in
additional_special_tokens.
🔖 License
This tokenizer and vocabulary are released under the Apache License 2.0. You are free to use, modify, and redistribute it under the same license terms.
📚 Citation
If you use QiTianTokenizer in your research or project, please cite it as:
@misc{QiTianTokenizer,
title = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
author = {Morton Li},
year = {2025},
}