Tiktoken and interaction with Transformers
Support for tiktoken model files is seamlessly integrated in 🤗 transformers when loading models
from_pretrained
with a tokenizer.model
tiktoken file on the Hub, which is automatically converted into our
fast tokenizer.
Known models that were released with a tiktoken.model :
- gpt2
- llama3
Example usage
In order to load tiktoken
files in transformers
, ensure that the tokenizer.model
file is a tiktoken file and it
will automatically be loaded when loading from_pretrained
. Here is how one would load a tokenizer and a model, which
can be loaded from the exact same file:
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original")
Create tiktoken tokenizer
The tokenizer.model
file contains no information about additional tokens or pattern strings. If these are important, convert the tokenizer to tokenizer.json
, the appropriate format for PreTrainedTokenizerFast.
Generate the tokenizer.model
file with tiktoken.get_encoding and then convert it to tokenizer.json
with convert_tiktoken_to_fast
.
from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding
# You can load your custom encoding or the one provided by OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")
The resulting tokenizer.json
file is saved to the specified directory and can be loaded with PreTrainedTokenizerFast.
tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")