How to convert without hitting tokeniser errors?

by anujn - opened Jun 29, 2023

Jun 29, 2023

Hey Michael, first thank you for all your converted models and HF hub tool. Both are super super helpful!!

I have been trying to convert the following model to CT2 for inference "ehartford/WizardLM-13B-V1.0-Uncensored"

I keep hitting vocab size difference errors. I think it may be because CT2 is adding an unk token? Not sure very new to CT2!

Just wondered if you could point me in the right direction or give me steps how to convert this particular model to CT2 for inf.

Thanks again!

Anuj

michaelfeil

Owner Jun 30, 2023

•

edited Jun 30, 2023

Please open a Issue at CTranslate2 :) https://github.com/OpenNMT/CTranslate2/blob/9885fad95f8ce24809d1ab64b418ac9f99c75562/python/ctranslate2/converters/transformers.py#L1182

You might be able to add:

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            # fix for additional vocab, see GPTNeoX Converter
            tokens.append("<extra_id_%d>" % i)

        return tokens

michaelfeil changed discussion status to closed Jun 30, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment