List of texts included in the corpus:
Beautiful Stories of Shakespeare, As you Like It, Hamlet, Julius Ceaser, King Lear II, King Richard II, Macbeth, Midnight Summer Dream, Othello, Shakespeare Roman Play, Shakespearean Text, Sonnets, Taming of the Shrew, The Tempest, Tragedy of Romeo Juliet
How many tokens are in each text and the total number of tokens in the corpus:
Beautiful Stories of Shakespeare 62,537; As you Like It 38,772; Hamlet 30,512; Julius Caesar 30,915; King Lear II 25,031; King Richard II 26,029; Macbeth 197,328; Midnight Summer Dream 248,172; Othello 197,328; Shakespeare Roman Play 24,582; Shakespearean Text 32,453; Sonnets 35,675; Taming of the Shrew 36,794; The Tempest 30,700; Tragedy of Romeo Juliet 37,623; Total: 1,054,451
How, when, and why the corpus was collected: The corpus was collected from Project Gutenberg https://www.gutenberg.org/ which is a library of over 60,000 free ebooks. I collected 15 books of Shakespeare and combined them to one text file. The corpus was collected to create a dataset of over a million tokens so a model could be fine tuned as per Shakespeare's work and generate text according to it. This corpus was created on 18 Feb 2023. How the text was pre-processed or tokenized: text was preprocessed by removing all empty spaces. the text was combined into a single line and broken down into paragraphs a combined into one text file that was used for training and valiating the model.
Values of hyperparameters used during fine tuning: max_length=768 tokenizer=GPT2 Batch Size=2 top_p=0.95 output max_length=200
Model Description: this model is finetuned on a corpus of Shakespeare's work to generate text in Shakespearean language.
Intended uses & limitations: This model can be used to generate text in the Shakespearean language.
How to use: This model can be downloaded from the hugging face library and can be run on Google Colab.
Training Data: This model is trained on a corpus of over a million tokens of Shakespeare's work, that was collected from 15 novel of Shakespeare from Gutenberg.org.
Training Procedure: This model was run on Google Colab using a GPU. Processing time took about 15 - 20 minutes. To select a GPU, click on Runtime and Change Runtime Type. Select GPU and Save. Then run the codes in Colab.
Variable and metrics: Prompt given to the model to start a sentence was "The" and max_length was set to 300.
Evaluation results: Results of text generation of this model is above satisfactory. The model was able to generate reasonable text and in Shakespearean language.
- Downloads last month
- 0