QiTianTokenizer-Large

QiTianTokenizer is a universal multilingual tokenizer primarily optimized for Chinese–English mixed text,
offering consistent and reversible tokenization across diverse languages and scripts.
It is designed as a general-purpose tokenizer, not tied to any specific model,
and fully compatible with the 🤗 Transformers ecosystem.


✨ Overview

Property Value
Name QiTianTokenizer-Large
Type Tokenizer-only repository
Purpose General multilingual tokenization
Primary Languages Chinese, English
Extended Support Multilingual (Unicode-complete)
Architecture Byte-level BPE
Vocabulary Size 96,000 tokens
Fast Implementation ✅ Available (QiTianTokenizerFast)
Framework 🤗 transformers
License Apache 2.0

🧩 QiTian Tokenizer Series

Variant Vocabulary Size Description Recommended Use
QiTianTokenizer-Tiny 12k Lightweight tokenizer designed for compact or embedded models. On-device or low-resource tasks
QiTianTokenizer-Base 32k Balanced vocabulary offering solid coverage and efficiency for most multilingual use cases. Recommended for general use
QiTianTokenizer-Medium 64k Optimal balance in language coverage — broad enough to capture fine-grained linguistic diversity while maintaining reasonable model complexity. Recommended for multilingual and high-quality general-purpose models
QiTianTokenizer-Large 96k Extended multilingual vocabulary designed for diverse cross-lingual pretraining and high-capacity language models. High-resource training
QiTianTokenizer-XLarge 128k Full-script and domain-extensive vocabulary for comprehensive multilingual modeling. Research & large-scale pretraining

All variants share consistent token definitions, special tokens, and compatible configurations.


⚙️ Usage

You can load this tokenizer directly with AutoTokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Large", trust_remote_code=True)

# Example
text = "你好,QiTian!"
tokens = tokenizer(text)
print(tokens["input_ids"])

➕ Batch Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Large", trust_remote_code=True)

# Example
texts = ["Hello, 世界!", "QiTian is multilingual."]
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
print(batch_tokens["input_ids"])

📦 Files Included

File Description
tokenizer.json Serialized fast tokenizer definition
tokenizer_config.json Configuration (max length, padding side, etc.)
special_tokens_map.json Special token mappings
tokenizer.py Tokenizer implementation

🔍 Special Tokens

Token Example Purpose
<|bos|> <|bos|> Beginning of sequence (BOS)
<|eos|> <|eos|> End of sequence (EOS)
<|pad|> <|pad|> Padding token for batch alignment
<|mask|> <|mask|> Masked token for MLM-style objectives
<|user|> <|user|> Marks user message boundary in conversational data
<|assistant|> <|assistant|> Marks assistant message boundary
<|system|> <|system|> Defines system or meta-instruction context
<|think|> <|think|> Reasoning-phase delimiter — marks model’s internal reasoning or structured thinking segment during inference

All tokens are integrated into the tokenizer vocabulary and appear in additional_special_tokens.


🔖 License

This tokenizer and vocabulary are released under the Apache License 2.0. You are free to use, modify, and redistribute it under the same license terms.


📚 Citation

If you use QiTianTokenizer in your research or project, please cite it as:

@misc{QiTianTokenizer,
  title  = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
  author = {Morton Li},
  year   = {2025},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support