TODO: Probably we need to add the spaces to the end of the vocab!!! Despite being useless!!!
This is https://huggingface.co/NeelNanda/gpt-neox-tokenizer-digits but with a fix that makes this work on tokenizers == 0.14
after a breaking change involving added_tokens
: https://github.com/huggingface/tokenizers/issues/1358
The two changes from NeelNanda/gpt-neox-tokenizer-digits are
- (Important) we remove the space tokens from the "added_tokens" key in
tokenizer.json
here: https://huggingface.co/ArthurConmy/alternative-neel-tokenizer/blob/main/tokenizer.json . These caused the breaking change along with the tokenizers PR above - (Not important) we use
GPTNeoXTokenizer
rather thanPretrainedTokenizerFast
intokenizer_config.json
as this seemed to match what GPT-NeoX did
Neel's README
This is a fork of the GPT NeoX 20B tokenizer, edited to split every numerical digit into a separate token. This has the goal of making it easier for the model to learn arithmetic capabilities and to hopefully be more interpretable, and copies the idea from the PaLM tokenizer.
This was done, extremely hackily, by just removing every token that contained "\d\d" (eg "2013"). All remaining digit containing tokens are "0" ... "9" and " 0" ... " 9"
This comes at the cost of making modelling normal text harder, since eg dates like 2013 which naturally should be a single token are now 2|0|1|3.
This has a reduced vocab size of 48252 (several of the tokens towards the end are special whitespace tokens copied in from GPT-NeoX to make tokenizing code easier - some of these are duplicated in the vocabulary and thus may not actually show up at train time).
It includes a padding token (<|PAD|>) an End-Of-String token (<|EOS|>) and a Beginning-Of-String token (<|BOS|>)