ritwikm commited on
Commit
3b98dbc
·
1 Parent(s): bb630ca

updated README

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -5,3 +5,15 @@ license: gpl
5
  # Introduction
6
 
7
  We have scrapped all the collected works of Mohandas Karamchand Gandhi (aka Mahatma Gandhi) from [here](http://www.gandhiashramsevagram.org/gandhi-literature/collected-works-of-mahatma-gandhi-volume-1-to-98.php). Cleaned the text so that it contains only the writings of Gandhi without footnotes, titles, and other texts.
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  # Introduction
6
 
7
  We have scrapped all the collected works of Mohandas Karamchand Gandhi (aka Mahatma Gandhi) from [here](http://www.gandhiashramsevagram.org/gandhi-literature/collected-works-of-mahatma-gandhi-volume-1-to-98.php). Cleaned the text so that it contains only the writings of Gandhi without footnotes, titles, and other texts.
8
+
9
+
10
+ We observed that (after cleaning), Gandhi wrote 755468 sentences.
11
+
12
+
13
+ We first fine-tuned gpt-2 for 1 epoch on the English corpus (after cleaning*) of [Ai4Bharat](https://ai4bharat.iitm.ac.in/corpora).
14
+
15
+ Since the above dataset contains news regarding Indian subcontinent. We thought that with this fine-tuning, the model will get familiary with India specific terms.
16
+
17
+ Then we further fine-tuned this model on sentences written by Gandhi.
18
+
19
+ *Before cleaning #sents = 54M, after cleaning 42M. We simply took those English sentences which ends with a full-stop.