File size: 3,475 Bytes
5e8dafa
 
 
 
 
 
 
d1d8aa0
5e8dafa
 
 
 
 
 
 
d1d8aa0
e4e22d2
95c34bb
5e8dafa
 
 
d1d8aa0
5e8dafa
 
 
d1d8aa0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31d41b6
d1d8aa0
 
 
 
5e8dafa
 
 
d1d8aa0
5e8dafa
 
 
d1d8aa0
 
5e8dafa
 
 
 
 
 
 
 
 
95c34bb
5e8dafa
 
 
 
e4e22d2
 
95c34bb
 
 
 
5e8dafa
 
31d41b6
 
 
 
 
 
 
 
 
 
 
5e8dafa
 
 
 
95c34bb
d1d8aa0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: mit
tags:
- generated_from_trainer
model-index:
- name: gpt2-shakespeare
  results: []
pipeline_tag: text-generation
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# gpt2-shakespeare

This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on [datasets](https://github.com/sadia-sust/dataset-finetune-gpt2) containing Shakespeare Books.
It achieves the following results on the evaluation set:
- Loss: 2.5738

## Model description

GPT-2 model is finetuned with text corpus.  

## Intended uses & limitations

Intended use for this model is to write novel in Shakespeare Style. It has limitations to write in other writer's style.

## Datasets Description

Text corpus is developed for fine-tuning gpt-2 model. Books are downloaded from [Project Gutenberg](http://www.gutenberg.org/) as plain text files. 
A large text corpus were needed to train the model to be abled to write in Shakespeare style. 


The following books are used to develop text corpus:

- Macbeth, word count: 38197
- THE TRAGEDY OF TITUS ANDRONICUS, word count: 40413
- King Richard II, word count: 48423
- Shakespeare's Tragedy of Romeo and Juliet, word count: 144935
- A MIDSUMMER NIGHT’S DREAM, word count: 36597
- ALL’S WELL THAT ENDS WELL, word count: 49363
- THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, word count: 57471
- THE TRAGEDY OF JULIUS CAESAR, word count: 37391
- THE TRAGEDY OF KING LEAR, word count: 54101
- THE LIFE AND DEATH OF KING RICHARD III, word count: 55985
- Romeo and Juliet, word count: 51417
- Measure for Measure, word count: 62703
- Much Ado about Nothing, word count: 45577
- Othello, the Moor of Venice, word count: 53967
- THE WINTER’S TALE, word count: 52911
- The Comedy of Errors, word count: 43179
- The Merchant of Venice, word count: 45903
- The Taming of the Shrew, word count: 44777
- The Tempest, word count: 32323
- TWELFTH NIGHT: OR, WHAT YOU WILL, word count: 42907
- The Sonnets, word count: 39849

Corpus has total 1078389 word tokens.

## Datasets Preprocessing

- Header text are removed manually.
- Using sent_tokenize() function from NLTK python library, extra spaces and new-lines were removed programmatically. 


## Training and evaluation data

Training dataset has 880447 word tokens and test dataset has 197913 word tokens.

## Training procedure

To train the model, training api from Transformer class is used. 

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 350
- num_epochs: 3

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| No log        | 0.63  | 250  | 2.7133          |
| 2.8492        | 1.25  | 500  | 2.6239          |
| 2.8492        | 1.88  | 750  | 2.5851          |
| 2.3842        | 2.51  | 1000 | 2.5738          |


## Sample Code Using Transformers Pipeline

```
from transformers import pipeline

story = pipeline('text-generation',model='./gpt2-shakespeare', tokenizer='gpt2', max_length = 300)
story("how art thou")

```


### Framework versions

- Transformers 4.26.1
- Pytorch 1.13.1+cu116
- Datasets 2.10.0
- Tokenizers 0.13.2