updated README
Browse files
README.md
CHANGED
@@ -5,3 +5,15 @@ license: gpl
|
|
5 |
# Introduction
|
6 |
|
7 |
We have scrapped all the collected works of Mohandas Karamchand Gandhi (aka Mahatma Gandhi) from [here](http://www.gandhiashramsevagram.org/gandhi-literature/collected-works-of-mahatma-gandhi-volume-1-to-98.php). Cleaned the text so that it contains only the writings of Gandhi without footnotes, titles, and other texts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
# Introduction
|
6 |
|
7 |
We have scrapped all the collected works of Mohandas Karamchand Gandhi (aka Mahatma Gandhi) from [here](http://www.gandhiashramsevagram.org/gandhi-literature/collected-works-of-mahatma-gandhi-volume-1-to-98.php). Cleaned the text so that it contains only the writings of Gandhi without footnotes, titles, and other texts.
|
8 |
+
|
9 |
+
|
10 |
+
We observed that (after cleaning), Gandhi wrote 755468 sentences.
|
11 |
+
|
12 |
+
|
13 |
+
We first fine-tuned gpt-2 for 1 epoch on the English corpus (after cleaning*) of [Ai4Bharat](https://ai4bharat.iitm.ac.in/corpora).
|
14 |
+
|
15 |
+
Since the above dataset contains news regarding Indian subcontinent. We thought that with this fine-tuning, the model will get familiary with India specific terms.
|
16 |
+
|
17 |
+
Then we further fine-tuned this model on sentences written by Gandhi.
|
18 |
+
|
19 |
+
*Before cleaning #sents = 54M, after cleaning 42M. We simply took those English sentences which ends with a full-stop.
|