QiTianTokenizer-Large

QiTianTokenizer is a universal multilingual tokenizer primarily optimized for Chinese–English mixed text,
offering consistent and reversible tokenization across diverse languages and scripts.
It is designed as a general-purpose tokenizer, not tied to any specific model,
and fully compatible with the 🤗 Transformers ecosystem.

✨ Overview

Property	Value
Name	QiTianTokenizer-Large
Type	Tokenizer-only repository
Purpose	General multilingual tokenization
Primary Languages	Chinese, English
Extended Support	Multilingual (Unicode-complete)
Architecture	Byte-level BPE
Vocabulary Size	96,000 tokens
Fast Implementation	✅ Available (`QiTianTokenizerFast`)
Framework	🤗 `transformers`
License	Apache 2.0

🧩 QiTian Tokenizer Series

Variant	Vocabulary Size	Description	Recommended Use
QiTianTokenizer-Tiny	12k	Lightweight tokenizer designed for compact or embedded models.	On-device or low-resource tasks
QiTianTokenizer-Base	32k	Balanced vocabulary offering solid coverage and efficiency for most multilingual use cases.	Recommended for general use
QiTianTokenizer-Medium	64k	Optimal balance in language coverage — broad enough to capture fine-grained linguistic diversity while maintaining reasonable model complexity.	Recommended for multilingual and high-quality general-purpose models
QiTianTokenizer-Large	96k	Extended multilingual vocabulary designed for diverse cross-lingual pretraining and high-capacity language models.	High-resource training
QiTianTokenizer-XLarge	128k	Full-script and domain-extensive vocabulary for comprehensive multilingual modeling.	Research & large-scale pretraining

All variants share consistent token definitions, special tokens, and compatible configurations.

⚙️ Usage

You can load this tokenizer directly with AutoTokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Large", trust_remote_code=True)

# Example
text = "你好，QiTian！"
tokens = tokenizer(text)
print(tokens["input_ids"])

➕ Batch Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Large", trust_remote_code=True)

# Example
texts = ["Hello, 世界！", "QiTian is multilingual."]
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
print(batch_tokens["input_ids"])

📦 Files Included

File	Description
`tokenizer.json`	Serialized fast tokenizer definition
`tokenizer_config.json`	Configuration (max length, padding side, etc.)
`special_tokens_map.json`	Special token mappings
`tokenizer.py`	Tokenizer implementation

🔍 Special Tokens

Token	Example	Purpose
`<\|bos\|>`	`<\|bos\|>`	Beginning of sequence (BOS)
`<\|eos\|>`	`<\|eos\|>`	End of sequence (EOS)
`<\|pad\|>`	`<\|pad\|>`	Padding token for batch alignment
`<\|mask\|>`	`<\|mask\|>`	Masked token for MLM-style objectives
`<\|user\|>`	`<\|user\|>`	Marks user message boundary in conversational data
`<\|assistant\|>`	`<\|assistant\|>`	Marks assistant message boundary
`<\|system\|>`	`<\|system\|>`	Defines system or meta-instruction context
`<\|think\|>`	`<\|think\|>`	Reasoning-phase delimiter — marks model’s internal reasoning or structured thinking segment during inference

All tokens are integrated into the tokenizer vocabulary and appear in additional_special_tokens.

🔖 License

This tokenizer and vocabulary are released under the Apache License 2.0. You are free to use, modify, and redistribute it under the same license terms.

📚 Citation

If you use QiTianTokenizer in your research or project, please cite it as:

@misc{QiTianTokenizer,
  title  = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
  author = {Morton Li},
  year   = {2025},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Token	Example	Purpose
`<\|bos\|>`	`<\|bos\|>`	Beginning of sequence (BOS)
`<\|eos\|>`	`<\|eos\|>`	End of sequence (EOS)
`<\|pad\|>`	`<\|pad\|>`	Padding token for batch alignment
`<\|mask\|>`	`<\|mask\|>`	Masked token for MLM-style objectives
`<\|user\|>`	`<\|user\|>`	Marks user message boundary in conversational data
`<\|assistant\|>`	`<\|assistant\|>`	Marks assistant message boundary
`<\|system\|>`	`<\|system\|>`	Defines system or meta-instruction context
`<\|think\|>`	`<\|think\|>`	Reasoning-phase delimiter — marks model’s internal reasoning or structured thinking segment during inference