File size: 6,605 Bytes
2fd5ab1
 
 
 
 
 
d54bd82
008c793
 
2fd5ab1
14cacb0
 
 
4fe20fd
 
 
 
 
14cacb0
 
2454536
e9579d5
94ed3e9
 
2454536
14cacb0
 
 
 
 
4da033d
14cacb0
4da033d
14cacb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2454536
14cacb0
 
 
 
 
 
1769f8a
 
3babd85
1769f8a
 
 
3babd85
1769f8a
 
3babd85
1769f8a
 
 
 
 
 
 
 
 
 
 
14cacb0
 
 
 
 
 
 
 
d54bd82
 
14cacb0
2fd5ab1
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
language:
- cs
metrics:
- perplexity
pipeline_tag: text-generation
license: mit
datasets:
- BUT-FIT/BUT-LCC
---
# Czech GPT
This is our GPT-2 XL trained as a part of the research involved in [SemANT project](https://www.fit.vut.cz/research/project/1629/.en).

# <span style="color:red">BUT LM Model Roster</span>
- [BUT-FIT/CSTinyLlama-1.2B](https://huggingface.co/BUT-FIT/CSTinyLlama-1.2B)
- [BUT-FIT/Czech-GPT-2-XL-133k](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k)
- [BUT-FIT/csmpt7b](https://huggingface.co/BUT-FIT/csmpt7b)

## Factsheet
- The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
- The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
- Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
- The model is trained in GPT-2 style, the first token is an actual text token (not bos). Thus first token probability can't be computed.
- Due to the feature of our code, our model was never trained to generate [EOS]. 
- The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
- The model was adapted from the original GPT-2 XL, by: 
   - replacing the tokenizer,  
   - corresponding embeddings, and  
   - copying over 1,000 EN representations corresponding to the 1,000 most frequent tokens into new embeddings based on a bilingual dictionary.
- The training loss decreased steadily, and the model definitely didn't converge yet. We compare the loss to a small 124M model version.
  <img src="XL_vs_SMALL_train.png" width="600"/>
- The validation loss also decreased steadily. We had a bug in validation for early/late steps, so we released only validation from steps 46,000 to 100,000. Similarly, we compare the loss to the small 124M model version.
  <img src="XL_vs_SMALL_test.png" width="600"/>

## Training parameters
Not mentioned parameters are the same as for GPT-2.

| **Name**                   | **Value**     | **Note**                                                                                     |
|----------------------------|---------------|----------------------------------------------------------------------------------------------|
| dataset_type               | Concat        | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
| tokenizer_size             | 64k           |                                                                                              |
| max_seq_len                | 1024          |                                                                                              |
| batch_size                 | 1024          |                                                                                              |
| learning_rate              | 1.0e-4        |                                                                                              |
| optimizer                  | LionW         |                                                                                              |
| optimizer_betas            | 0.9/0.95      |                                                                                              |
| optimizer_weight_decay     | 0             |                                                                                              |
| optimizer_eps              | 1.0e-08       |                                                                                              |
| gradient_clipping_max_norm | 1.0           |                                                                                              |
| attn_impl                  | flash2        |                                                                                              |
| dropout                    | 0.1           | for residuals, attention, embeddings                                                         |
| fsdp                       | SHARD_GRAD_OP | (optimized for A100 40GB GPUs)                                                               |
| precision                  | bf16          |                                                                                              |
| scheduler                  | linear        |                                                                                              |
| scheduler_warmup           | 10,000 steps  |                                                                                              |
| scheduler_steps            | 200,000       |                                                                                              |
| scheduler_alpha            | 0.1           | So LR on last step is 0.1*(vanilla LR)                                                       |

## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

t = AutoTokenizer.from_pretrained("BUT-FIT/Czech-GPT-2-XL-133k")
m = AutoModelForCausalLM.from_pretrained("BUT-FIT/Czech-GPT-2-XL-133k").eval()

# Try the model inference
prompt = "Nejznámějším českým spisovatelem "
input_ids = t.encode(prompt, return_tensors="pt")
with torch.no_grad():
    generated_text = m.generate(input_ids=input_ids,
                                do_sample=True,
                                top_p=0.95,
                                repetition_penalty=1.0,
                                temperature=0.8,
                                max_new_tokens=64,
                                num_return_sequences=1)
    print(t.decode(generated_text[0], skip_special_tokens=True))
```

## Evaluation
We observed 10-shot result improvement over the course of training for sentiment analysis, and hellaswag-like commonsense reasoning. 
There were some tasks where there was no such improvement, such as grammar error classification (does the sentence contain grammatical error?).
We will release the precise results once we advance with the work on our Czech evaluation kit.


## Disclaimer
This is an intermediate result of our work-in-progress. This is a probabilistic model, and authors are not responsible for the model outputs. Use at your own risk. 
For further questions, turn to `martin.fajcik@vut.cz`.

## Acknowledgement
This work was supported by NAKI III program of  Ministry of Culture Czech Republic, project semANT --- 
"Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and
by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).