Is the code for building the tokenizer open sourced?
#3
by
Akirami
- opened
I want to know how the tokenizer was built and if possible the whole training process
Im almost certain its a variation of o200k tokenization
I thought they developed it on Llama Tokenizer
It is a vanilla sentencepiece tokenizer trained on a subset of our training data. No fancy stuff.
Cool. Thanks for letting me know
Akirami
changed discussion status to
closed