Getting error while loading the tokenizer after fine-tuning
Hi, after fine-tuning the "ai4bharat/indictrans2-en-indic-1B" model, when I tried to load the tokenizer by "tokenizer = AutoTokenizer.from_pretrained(finetuned_model_dir, trust_remote_code=True)" getting the following error in the file "tokenization_indictrans.py":
In line 120 --
TypeError: transformers.tokenization_utils.PreTrainedTokenizer.init() got multiple values for keyword argument 'src_vocab_file'
I’m assuming you didn’t modify the vocabulary or tokenizer and just used the existing tokenizer to preprocess your data and fine-tune the model.
If that’s the case, you can directly use:
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True)
This will load the tokenizer, and it should work as expected and is compatible even with your fine-tuned model.
The error occurs because, when saving with tokenizer.save_pretrained
, some additional fields are being saved to the config. These include the arguments mentioned above, which are added to the **kwargs
. While the original config or thus **kwargs
doesn’t have these extra arguments, this saved one has it and this creates duplicate values for the same keyword argument (as we already pass appropriate paths manually in the tokenization script).
It's working.. Thanks a lot!