How to convert without hitting tokeniser errors?
#2
by
anujn
- opened
Hey Michael, first thank you for all your converted models and HF hub tool. Both are super super helpful!!
I have been trying to convert the following model to CT2 for inference "ehartford/WizardLM-13B-V1.0-Uncensored"
I keep hitting vocab size difference errors. I think it may be because CT2 is adding an unk token? Not sure very new to CT2!
Just wondered if you could point me in the right direction or give me steps how to convert this particular model to CT2 for inf.
Thanks again!
Anuj
Please open a Issue at CTranslate2 :) https://github.com/OpenNMT/CTranslate2/blob/9885fad95f8ce24809d1ab64b418ac9f99c75562/python/ctranslate2/converters/transformers.py#L1182
You might be able to add:
def get_vocabulary(self, model, tokenizer):
tokens = super().get_vocabulary(model, tokenizer)
extra_ids = model.config.vocab_size - len(tokens)
for i in range(extra_ids):
# fix for additional vocab, see GPTNeoX Converter
tokens.append("<extra_id_%d>" % i)
return tokens
michaelfeil
changed discussion status to
closed