File size: 364 Bytes
bb630ca
 
 
 
1
2
3
4
5
```myocr.py``` is responsible for scrapping all the writings of Mahatma Gandhi.


```data_preprocessing.py``` does the data cleaning, and prepares a file which is ready to be inputted into the gpt-2 finetuning pipeline. In this code, we have set the threshold of 200 i.e., paragraphs whose number of token_ids are > 200, they will be split in half (recursively).