gpt2-shakespeare / README.md
sadia72's picture
Update README.md
31d41b6
---
license: mit
tags:
- generated_from_trainer
model-index:
- name: gpt2-shakespeare
results: []
pipeline_tag: text-generation
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# gpt2-shakespeare
This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on [datasets](https://github.com/sadia-sust/dataset-finetune-gpt2) containing Shakespeare Books.
It achieves the following results on the evaluation set:
- Loss: 2.5738
## Model description
GPT-2 model is finetuned with text corpus.
## Intended uses & limitations
Intended use for this model is to write novel in Shakespeare Style. It has limitations to write in other writer's style.
## Datasets Description
Text corpus is developed for fine-tuning gpt-2 model. Books are downloaded from [Project Gutenberg](http://www.gutenberg.org/) as plain text files.
A large text corpus were needed to train the model to be abled to write in Shakespeare style.
The following books are used to develop text corpus:
- Macbeth, word count: 38197
- THE TRAGEDY OF TITUS ANDRONICUS, word count: 40413
- King Richard II, word count: 48423
- Shakespeare's Tragedy of Romeo and Juliet, word count: 144935
- A MIDSUMMER NIGHT’S DREAM, word count: 36597
- ALL’S WELL THAT ENDS WELL, word count: 49363
- THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, word count: 57471
- THE TRAGEDY OF JULIUS CAESAR, word count: 37391
- THE TRAGEDY OF KING LEAR, word count: 54101
- THE LIFE AND DEATH OF KING RICHARD III, word count: 55985
- Romeo and Juliet, word count: 51417
- Measure for Measure, word count: 62703
- Much Ado about Nothing, word count: 45577
- Othello, the Moor of Venice, word count: 53967
- THE WINTER’S TALE, word count: 52911
- The Comedy of Errors, word count: 43179
- The Merchant of Venice, word count: 45903
- The Taming of the Shrew, word count: 44777
- The Tempest, word count: 32323
- TWELFTH NIGHT: OR, WHAT YOU WILL, word count: 42907
- The Sonnets, word count: 39849
Corpus has total 1078389 word tokens.
## Datasets Preprocessing
- Header text are removed manually.
- Using sent_tokenize() function from NLTK python library, extra spaces and new-lines were removed programmatically.
## Training and evaluation data
Training dataset has 880447 word tokens and test dataset has 197913 word tokens.
## Training procedure
To train the model, training api from Transformer class is used.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 350
- num_epochs: 3
### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| No log | 0.63 | 250 | 2.7133 |
| 2.8492 | 1.25 | 500 | 2.6239 |
| 2.8492 | 1.88 | 750 | 2.5851 |
| 2.3842 | 2.51 | 1000 | 2.5738 |
## Sample Code Using Transformers Pipeline
```
from transformers import pipeline
story = pipeline('text-generation',model='./gpt2-shakespeare', tokenizer='gpt2', max_length = 300)
story("how art thou")
```
### Framework versions
- Transformers 4.26.1
- Pytorch 1.13.1+cu116
- Datasets 2.10.0
- Tokenizers 0.13.2