Is the code for building the tokenizer open sourced?

by Akirami - opened Aug 17

Aug 17

I want to know how the tokenizer was built and if possible the whole training process

Aug 21

Im almost certain its a variation of o200k tokenization

Aug 23

I thought they developed it on Llama Tokenizer

Sarvam AI org Aug 23

It is a vanilla sentencepiece tokenizer trained on a subset of our training data. No fancy stuff.

Aug 23

Cool. Thanks for letting me know

Akirami changed discussion status to closed Aug 23

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment