First model version

Browse files

Files changed (7) hide show

README.md +79 -0
config.json +37 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,79 @@

+---
+language:
+- el
+tags:
+- pytorch
+- causal-lm
+widget:
+- text: "Το αγαπημένο μου μέρος είναι"
+license: apache-2.0
+---
+# Greek (el) GPT2 model
+<img src="https://huggingface.co/lighteternal/gpt2-finetuned-greek-small/raw/main/GPT2el.png" width="600"/>
+### By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)
+* language: el
+* licence: apache-2.0
+* dataset: ~23.4 GB of Greek corpora
+* model: GPT2 (12-layer, 768-hidden, 12-heads, 117M parameters. OpenAI GPT-2 English model, finetuned for the Greek language)
+* pre-processing: tokenization + BPE segmentation
+* metrics: perplexity
+### Model description
+A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT-2.
+Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low-resource languages.
+Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing
+### How to use
+```
+from transformers import pipeline
+model = "lighteternal/gpt2-finetuned-greek"
+generator = pipeline(
+    'text-generation',
+    device=0,
+    model=f'{model}',
+    tokenizer=f'{model}')
+text = "Μια φορά κι έναν καιρό"
+print("\n".join([x.get("generated_text") for x in generator(
+    text,
+    max_length=len(text.split(" "))+15,
+    do_sample=True,
+    top_k=50,
+    repetition_penalty = 1.2,
+    add_special_tokens=False,
+    num_return_sequences=5,
+    temperature=0.95,
+    top_p=0.95)]))
+```
+## Training data
+We used a 23.4MB sample from a consolidated Greek corpus from CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices containing long senquences.
+This is a better version of our GPT-2 small model (https://huggingface.co/lighteternal/gpt2-finetuned-greek-small)
+## Metrics
+| Metric      | Value |
+| ----------- | ----------- |
+| Train Loss | 3.67 |
+| Validation Loss | 3.83 |
+| Perplexity  | 39.12 |
+### BibTeX entry and citation info
+Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020

config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 50256,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "resid_pdrop": 0.1,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 50,
+      "top_k": 50,
+      "repetition_penalty": 60.0,
+      "add_special_tokens": false,
+      "temperature": 0.95,
+      "top_p": 0.95
+    }
+  },
+  "vocab_size": 50257
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bbeff8406c7bee780c84533d118071eb99b472b977bf64c24f90ae3218dd09ff
+size 510405982

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<\|endoftext\|>", "eos_token": "<\|endoftext\|>", "unk_token": "<\|endoftext\|>", "pad_token": "<\|endoftext\|>"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"pad_token": "<\|endoftext\|>", "special_tokens_map_file": null, "full_tokenizer_file": null}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff