File size: 1,081 Bytes
8532dcf
 
 
6601038
 
 
946bd89
6601038
 
39129e8
6601038
 
39129e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
---
license: apache-2.0
---

# donut-base-ascii

This is ["naver-clova-ix/donut-base"](https://huggingface.co/naver-clova-ix/donut-base) but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.


The original model, `"naver-clova-ix/donut-base"`, did not have a token for `"1"`, so that has also been added. The notebook [remove-donut-tokens.ipynb](remove-donut-tokens.ipynb) details the whole process.


This has not been trained any more than the original model.

I made a whole video about it: https://youtu.be/Uzr553x1gdM


I did a quick speed test for generation against the default model and using `bad_words_ids`. The `bad_words_ids` was only 12k tokens instead of the 30k that were removed and it was still noticeably slower.

Speed script [here](speed_test.py)  
Launched with [this](run_speed_tests.sh)


approach | time to generate 10 tokens
- | - 
"naver-clova-ix/donut-base" | 205ms
"naver-clova-ix/donut-base" + 12k `bad_words_ids` | 280ms
"donut-base-ascii" | 195ms