Introduction

We have scrapped all the collected works of Mohandas Karamchand Gandhi (aka Mahatma Gandhi) from here. Cleaned the text so that it contains only the writings of Gandhi without footnotes, titles, and other texts.

We observed that (after cleaning), Gandhi wrote 755468 sentences.

We first fine-tuned gpt-2 for 1 epoch on the English corpus (after cleaning*) of Ai4Bharat.

Since the above dataset contains news regarding Indian subcontinent. We thought that with this fine-tuning, the model will get familiary with India specific terms.

Then we further fine-tuned this model on sentences written by Gandhi (for 3 epochs).

Here is the colab link with a working example.

*Before cleaning #sents = 54M, after cleaning 42M. We simply took those English sentences which ends with a full-stop.

Downloads last month
32
Safetensors
Model size
137M params
Tensor type
F32
·
U8
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.