Text Generation
Transformers
Safetensors
English
falcon_mamba
Eval Results
Inference Endpoints

Is it possible to extend tokens to models?

#8
by badrabbitt - opened

Can I extend the token for this model to extend the language for the model beyond English?

Technology Innovation Institute org

Hi there!

Basically, we are using the same tokenizer of Falcon-7B/11B, which has the supports for English (en), German (de), Spanish (es), French (fr), Italian (it), Dutch (nl), Polish (pl), Portuguese (pt), Czech (cz), Romanian (ro) and Swedish (sw).

For the above languages, you can simply continue the pretraining to enable multilingual capabilities. Beyond that, you may need to extend the vocabulary to the target languages, a simple example is Chinese-LLaMA

We will put more details in our technical report.

Stay tuned! :)

Sign up or log in to comment