File size: 1,070 Bytes
fdf1ec9
 
 
7e63e07
 
 
 
3b98dbc
 
 
 
 
 
 
 
 
906e2e2
3b98dbc
ac017c4
 
3b98dbc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
---
license: gpl
---

# Introduction

We have scrapped all the collected works of Mohandas Karamchand Gandhi (aka Mahatma Gandhi) from [here](http://www.gandhiashramsevagram.org/gandhi-literature/collected-works-of-mahatma-gandhi-volume-1-to-98.php). Cleaned the text so that it contains only the writings of Gandhi without footnotes, titles, and other texts.


We observed that (after cleaning), Gandhi wrote 755468 sentences.


We first fine-tuned gpt-2 for 1 epoch on the English corpus (after cleaning*) of [Ai4Bharat](https://ai4bharat.iitm.ac.in/corpora).

Since the above dataset contains news regarding Indian subcontinent. We thought that with this fine-tuning, the model will get familiary with India specific terms.

Then we further fine-tuned this model on sentences written by Gandhi (for 3 epochs).

[Here](https://colab.research.google.com/drive/1NAUCRoIf3fXu2wGDjiRhIrRe8sB80h81?usp=sharing) is the colab link with a working example.

*Before cleaning #sents = 54M, after cleaning 42M. We simply took those English sentences which ends with a full-stop.