Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/pierreguillou/gpt2-small-portuguese/README.md
README.md
ADDED
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: pt
|
3 |
+
|
4 |
+
widget:
|
5 |
+
- text: "Quem era Jim Henson? Jim Henson era um"
|
6 |
+
- text: "Em um achado chocante, o cientista descobriu um"
|
7 |
+
- text: "Barack Hussein Obama II, nascido em 4 de agosto de 1961, é"
|
8 |
+
- text: "Corrida por vacina contra Covid-19 já tem"
|
9 |
+
license: mit
|
10 |
+
datasets:
|
11 |
+
- wikipedia
|
12 |
+
---
|
13 |
+
|
14 |
+
# GPorTuguese-2: a Language Model for Portuguese text generation (and more NLP tasks...)
|
15 |
+
|
16 |
+
## Introduction
|
17 |
+
|
18 |
+
GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
|
19 |
+
|
20 |
+
It was trained on Portuguese Wikipedia using **Transfer Learning and Fine-tuning techniques** in just over a day, on one GPU NVIDIA V100 32GB and with a little more than 1GB of training data.
|
21 |
+
|
22 |
+
It is a proof-of-concept that it is possible to get a state-of-the-art language model in any language with low ressources.
|
23 |
+
|
24 |
+
It was fine-tuned from the [English pre-trained GPT-2 small](https://huggingface.co/gpt2) using the Hugging Face libraries (Transformers and Tokenizers) wrapped into the [fastai v2](https://dev.fast.ai/) Deep Learning framework. All the fine-tuning fastai v2 techniques were used.
|
25 |
+
|
26 |
+
It is now available on Hugging Face. For further information or requests, please go to "[Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 (practical case with Portuguese)](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787)".
|
27 |
+
|
28 |
+
## Model
|
29 |
+
|
30 |
+
| Model | #params | Model file (pt/tf) | Arch. | Training /Validation data (text) |
|
31 |
+
|-------------------------|---------|--------------------|-------------|------------------------------------------|
|
32 |
+
| `gpt2-small-portuguese` | 124M | 487M / 475M | GPT-2 small | Portuguese Wikipedia (1.28 GB / 0.32 GB) |
|
33 |
+
|
34 |
+
## Evaluation results
|
35 |
+
In a little more than a day (we only used one GPU NVIDIA V100 32GB; through a Distributed Data Parallel (DDP) training mode, we could have divided by three this time to 10 hours, just with 2 GPUs), we got a loss of 3.17, an **accuracy of 37.99%** and a **perplexity of 23.76** (see the validation results table below).
|
36 |
+
|
37 |
+
| after ... epochs | loss | accuracy (%) | perplexity | time by epoch | cumulative time |
|
38 |
+
|------------------|------|--------------|------------|---------------|-----------------|
|
39 |
+
| 0 | 9.95 | 9.90 | 20950.94 | 00:00:00 | 00:00:00 |
|
40 |
+
| 1 | 3.64 | 32.52 | 38.12 | 5:48:31 | 5:48:31 |
|
41 |
+
| 2 | 3.30 | 36.29 | 27.16 | 5:38:18 | 11:26:49 |
|
42 |
+
| 3 | 3.21 | 37.46 | 24.71 | 6:20:51 | 17:47:40 |
|
43 |
+
| 4 | 3.19 | 37.74 | 24.21 | 6:06:29 | 23:54:09 |
|
44 |
+
| 5 | 3.17 | 37.99 | 23.76 | 6:16:22 | 30:10:31 |
|
45 |
+
|
46 |
+
## GPT-2
|
47 |
+
|
48 |
+
*Note: information copied/pasted from [Model: gpt2 >> GPT-2](https://huggingface.co/gpt2#gpt-2)*
|
49 |
+
|
50 |
+
Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in this [paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and first released at this [page](https://openai.com/blog/better-language-models/) (February 14, 2019).
|
51 |
+
|
52 |
+
Disclaimer: The team releasing GPT-2 also wrote a [model card](https://github.com/openai/gpt-2/blob/master/model_card.md) for their model. Content from this model card has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.
|
53 |
+
|
54 |
+
## Model description
|
55 |
+
|
56 |
+
*Note: information copied/pasted from [Model: gpt2 >> Model description](https://huggingface.co/gpt2#model-description)*
|
57 |
+
|
58 |
+
GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.
|
59 |
+
|
60 |
+
More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens.
|
61 |
+
|
62 |
+
This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.
|
63 |
+
|
64 |
+
## How to use GPorTuguese-2 with HuggingFace (PyTorch)
|
65 |
+
|
66 |
+
The following code use PyTorch. To use TensorFlow, check the below corresponding paragraph.
|
67 |
+
|
68 |
+
### Load GPorTuguese-2 and its sub-word tokenizer (Byte-level BPE)
|
69 |
+
|
70 |
+
```python
|
71 |
+
from transformers import AutoTokenizer, AutoModelWithLMHead
|
72 |
+
import torch
|
73 |
+
|
74 |
+
tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-portuguese")
|
75 |
+
model = AutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-small-portuguese")
|
76 |
+
|
77 |
+
# Get sequence length max of 1024
|
78 |
+
tokenizer.model_max_length=1024
|
79 |
+
|
80 |
+
model.eval() # disable dropout (or leave in train mode to finetune)
|
81 |
+
```
|
82 |
+
|
83 |
+
### Generate one word
|
84 |
+
|
85 |
+
```python
|
86 |
+
# input sequence
|
87 |
+
text = "Quem era Jim Henson? Jim Henson era um"
|
88 |
+
inputs = tokenizer(text, return_tensors="pt")
|
89 |
+
|
90 |
+
# model output
|
91 |
+
outputs = model(**inputs, labels=inputs["input_ids"])
|
92 |
+
loss, logits = outputs[:2]
|
93 |
+
predicted_index = torch.argmax(logits[0, -1, :]).item()
|
94 |
+
predicted_text = tokenizer.decode([predicted_index])
|
95 |
+
|
96 |
+
# results
|
97 |
+
print('input text:', text)
|
98 |
+
print('predicted text:', predicted_text)
|
99 |
+
|
100 |
+
# input text: Quem era Jim Henson? Jim Henson era um
|
101 |
+
# predicted text: homem
|
102 |
+
```
|
103 |
+
|
104 |
+
### Generate one full sequence
|
105 |
+
|
106 |
+
```python
|
107 |
+
# input sequence
|
108 |
+
text = "Quem era Jim Henson? Jim Henson era um"
|
109 |
+
inputs = tokenizer(text, return_tensors="pt")
|
110 |
+
|
111 |
+
# model output using Top-k sampling text generation method
|
112 |
+
sample_outputs = model.generate(inputs.input_ids,
|
113 |
+
pad_token_id=50256,
|
114 |
+
do_sample=True,
|
115 |
+
max_length=50, # put the token number you want
|
116 |
+
top_k=40,
|
117 |
+
num_return_sequences=1)
|
118 |
+
|
119 |
+
# generated sequence
|
120 |
+
for i, sample_output in enumerate(sample_outputs):
|
121 |
+
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
|
122 |
+
|
123 |
+
# >> Generated text
|
124 |
+
# Quem era Jim Henson? Jim Henson era um executivo de televisão e diretor de um grande estúdio de cinema mudo chamado Selig,
|
125 |
+
# depois que o diretor de cinema mudo Georges Seuray dirigiu vários filmes para a Columbia e o estúdio.
|
126 |
+
```
|
127 |
+
|
128 |
+
## How to use GPorTuguese-2 with HuggingFace (TensorFlow)
|
129 |
+
|
130 |
+
The following code use TensorFlow. To use PyTorch, check the above corresponding paragraph.
|
131 |
+
|
132 |
+
### Load GPorTuguese-2 and its sub-word tokenizer (Byte-level BPE)
|
133 |
+
|
134 |
+
```python
|
135 |
+
from transformers import AutoTokenizer, TFAutoModelWithLMHead
|
136 |
+
import tensorflow as tf
|
137 |
+
|
138 |
+
tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-portuguese")
|
139 |
+
model = TFAutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-small-portuguese")
|
140 |
+
|
141 |
+
# Get sequence length max of 1024
|
142 |
+
tokenizer.model_max_length=1024
|
143 |
+
|
144 |
+
model.eval() # disable dropout (or leave in train mode to finetune)
|
145 |
+
```
|
146 |
+
|
147 |
+
### Generate one full sequence
|
148 |
+
|
149 |
+
```python
|
150 |
+
# input sequence
|
151 |
+
text = "Quem era Jim Henson? Jim Henson era um"
|
152 |
+
inputs = tokenizer.encode(text, return_tensors="tf")
|
153 |
+
|
154 |
+
# model output using Top-k sampling text generation method
|
155 |
+
outputs = model.generate(inputs, eos_token_id=50256, pad_token_id=50256,
|
156 |
+
do_sample=True,
|
157 |
+
max_length=40,
|
158 |
+
top_k=40)
|
159 |
+
print(tokenizer.decode(outputs[0]))
|
160 |
+
|
161 |
+
# >> Generated text
|
162 |
+
# Quem era Jim Henson? Jim Henson era um amigo familiar da família. Ele foi contratado pelo seu pai
|
163 |
+
# para trabalhar como aprendiz no escritório de um escritório de impressão, e então começou a ganhar dinheiro
|
164 |
+
|
165 |
+
```
|
166 |
+
|
167 |
+
## Limitations and bias
|
168 |
+
|
169 |
+
The training data used for this model come from Portuguese Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:
|
170 |
+
|
171 |
+
> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.
|
172 |
+
|
173 |
+
## Author
|
174 |
+
|
175 |
+
Portuguese GPT-2 small was trained and evaluated by [Pierre GUILLOU](https://www.linkedin.com/in/pierreguillou/) thanks to the computing power of the GPU (GPU NVIDIA V100 32 Go) of the [AI Lab](https://www.linkedin.com/company/ailab-unb/) (University of Brasilia) to which I am attached as an Associate Researcher in NLP and the participation of its directors in the definition of NLP strategy, Professors Fabricio Ataides Braz and Nilton Correia da Silva.
|
176 |
+
|
177 |
+
## Citation
|
178 |
+
If you use our work, please cite:
|
179 |
+
|
180 |
+
```bibtex
|
181 |
+
@inproceedings{pierre2020gpt2smallportuguese,
|
182 |
+
title={GPorTuguese-2 (Portuguese GPT-2 small): a Language Model for Portuguese text generation (and more NLP tasks...)},
|
183 |
+
author={Pierre Guillou},
|
184 |
+
year={2020}
|
185 |
+
}
|
186 |
+
```
|