metadata

license: apache-2.0

donut-base-ascii

This is "naver-clova-ix/donut-base" but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.

The original model, "naver-clova-ix/donut-base", did not have a token for "1", so that has also been added. The notebook remove-donut-tokens.ipynb details the whole process.

This has not been trained any more than the original model.

I made a whole video about it: https://youtu.be/Uzr553x1gdM

I did a quick speed test for generation against the default model and using bad_words_ids. The bad_words_ids was only 12k tokens instead of the 30k that were removed and it was still noticeably slower.

Speed script here
Launched with this

approach	time to generate 10 tokens
"naver-clova-ix/donut-base"	205ms
"naver-clova-ix/donut-base" + 12k `bad_words_ids`	280ms
"donut-base-ascii"	195ms