tiiuae/falcon-mamba-7b · Is it possible to extend tokens to models?

Aug 15, 2024

Can I extend the token for this model to extend the language for the model beyond English?

Technology Innovation Institute org Sep 1, 2024

Hi there!

Basically, we are using the same tokenizer of Falcon-7B/11B, which has the supports for English (en), German (de), Spanish (es), French (fr), Italian (it), Dutch (nl), Polish (pl), Portuguese (pt), Czech (cz), Romanian (ro) and Swedish (sw).

For the above languages, you can simply continue the pretraining to enable multilingual capabilities. Beyond that, you may need to extend the vocabulary to the target languages, a simple example is Chinese-LLaMA

We will put more details in our technical report.

Stay tuned! :)

badrabbitt

Sep 14, 2024

Thanksss