--- license: mit tags: - generated_from_trainer model-index: - name: gpt2-shakespeare results: [] pipeline_tag: text-generation --- # gpt2-shakespeare This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on [datasets](https://github.com/sadia-sust/dataset-finetune-gpt2) containing Shakespeare Books. It achieves the following results on the evaluation set: - Loss: 2.5738 ## Model description GPT-2 model is finetuned with text corpus. ## Intended uses & limitations Intended use for this model is to write novel in Shakespeare Style. It has limitations to write in other writer's style. ## Datasets Description Text corpus is developed for fine-tuning gpt-2 model. Books are downloaded from [Project Gutenberg](http://www.gutenberg.org/) as plain text files. A large text corpus were needed to train the model to be abled to write in Shakespeare style. The following books are used to develop text corpus: - Macbeth, word count: 38197 - THE TRAGEDY OF TITUS ANDRONICUS, word count: 40413 - King Richard II, word count: 48423 - Shakespeare's Tragedy of Romeo and Juliet, word count: 144935 - A MIDSUMMER NIGHT’S DREAM, word count: 36597 - ALL’S WELL THAT ENDS WELL, word count: 49363 - THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, word count: 57471 - THE TRAGEDY OF JULIUS CAESAR, word count: 37391 - THE TRAGEDY OF KING LEAR, word count: 54101 - THE LIFE AND DEATH OF KING RICHARD III, word count: 55985 - Romeo and Juliet, word count: 51417 - Measure for Measure, word count: 62703 - Much Ado about Nothing, word count: 45577 - Othello, the Moor of Venice, word count: 53967 - THE WINTER’S TALE, word count: 52911 - The Comedy of Errors, word count: 43179 - The Merchant of Venice, word count: 45903 - The Taming of the Shrew, word count: 44777 - The Tempest, word count: 32323 - TWELFTH NIGHT: OR, WHAT YOU WILL, word count: 42907 - The Sonnets, word count: 39849 Corpus has total 1078389 word tokens. ## Datasets Preprocessing - Header text are removed manually. - Using sent_tokenize() function from NLTK python library, extra spaces and new-lines were removed programmatically. ## Training and evaluation data Training dataset has 880447 word tokens and test dataset has 197913 word tokens. ## Training procedure To train the model, training api from Transformer class is used. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 32 - eval_batch_size: 64 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 350 - num_epochs: 3 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | No log | 0.63 | 250 | 2.7133 | | 2.8492 | 1.25 | 500 | 2.6239 | | 2.8492 | 1.88 | 750 | 2.5851 | | 2.3842 | 2.51 | 1000 | 2.5738 | ## Sample Code Using Transformers Pipeline ``` from transformers import pipeline story = pipeline('text-generation',model='./gpt2-shakespeare', tokenizer='gpt2', max_length = 300) story("how art thou") ``` ### Framework versions - Transformers 4.26.1 - Pytorch 1.13.1+cu116 - Datasets 2.10.0 - Tokenizers 0.13.2