piotr-rybak
commited on
Commit
•
f8c0dd1
1
Parent(s):
796f174
Update README.md
Browse files
README.md
CHANGED
@@ -2,22 +2,34 @@
|
|
2 |
language: pl
|
3 |
tags:
|
4 |
- herbert
|
5 |
-
license: cc-by-
|
6 |
---
|
7 |
|
8 |
# HerBERT
|
9 |
-
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish
|
10 |
-
using MLM and SSO
|
|
|
11 |
Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
## Tokenizer
|
14 |
-
The training dataset was tokenized into subwords using
|
15 |
a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
|
16 |
-
We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``.
|
17 |
-
|
18 |
-
## HerBERT usage
|
19 |
|
|
|
20 |
|
|
|
21 |
Example code:
|
22 |
```python
|
23 |
from transformers import AutoTokenizer, AutoModel
|
@@ -40,12 +52,29 @@ output = model(
|
|
40 |
)
|
41 |
```
|
42 |
|
43 |
-
|
44 |
## License
|
45 |
-
CC BY
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
## Authors
|
49 |
-
|
50 |
|
51 |
-
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
|
|
|
2 |
language: pl
|
3 |
tags:
|
4 |
- herbert
|
5 |
+
license: cc-by-4.0
|
6 |
---
|
7 |
|
8 |
# HerBERT
|
9 |
+
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish corpora
|
10 |
+
using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: [HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish](https://www.aclweb.org/anthology/2021.bsnlp-1.1/).
|
11 |
+
|
12 |
Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
|
13 |
|
14 |
+
## Corpus
|
15 |
+
HerBERT was trained on six different corpora available for Polish language:
|
16 |
+
|
17 |
+
| Corpus | Tokens | Documents |
|
18 |
+
| :------ | ------: | ------: |
|
19 |
+
| [CCNet Middle](https://github.com/facebookresearch/cc_net) | 3243M | 7.9M |
|
20 |
+
| [CCNet Head](https://github.com/facebookresearch/cc_net) | 2641M | 7.0M |
|
21 |
+
| [National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0)| 1357M | 3.9M |
|
22 |
+
| [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1056M | 1.1M
|
23 |
+
| [Wikipedia](https://dumps.wikimedia.org/) | 260M | 1.4M |
|
24 |
+
| [Wolne Lektury](https://wolnelektury.pl/) | 41M | 5.5k |
|
25 |
+
|
26 |
## Tokenizer
|
27 |
+
The training dataset was tokenized into subwords using a character level byte-pair encoding (``CharBPETokenizer``) with
|
28 |
a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
|
|
|
|
|
|
|
29 |
|
30 |
+
We kindly encourage you to use the ``Fast`` version of the tokenizer, namely ``HerbertTokenizerFast``.
|
31 |
|
32 |
+
## Usage
|
33 |
Example code:
|
34 |
```python
|
35 |
from transformers import AutoTokenizer, AutoModel
|
|
|
52 |
)
|
53 |
```
|
54 |
|
|
|
55 |
## License
|
56 |
+
CC BY 4.0
|
57 |
|
58 |
+
## Citation
|
59 |
+
If you use this model, please cite the following paper:
|
60 |
+
```
|
61 |
+
@inproceedings{mroczkowski-etal-2021-herbert,
|
62 |
+
title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
|
63 |
+
author = "Mroczkowski, Robert and
|
64 |
+
Rybak, Piotr and
|
65 |
+
Wr{\'o}blewska, Alina and
|
66 |
+
Gawlik, Ireneusz",
|
67 |
+
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
|
68 |
+
month = apr,
|
69 |
+
year = "2021",
|
70 |
+
address = "Kiyv, Ukraine",
|
71 |
+
publisher = "Association for Computational Linguistics",
|
72 |
+
url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
|
73 |
+
pages = "1--10",
|
74 |
+
}
|
75 |
+
```
|
76 |
|
77 |
## Authors
|
78 |
+
The model was trained by **Machine Learning Research Team at Allegro** and [**Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**](http://zil.ipipan.waw.pl/).
|
79 |
|
80 |
+
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
|