allegro
/

herbert-base-cased

@@ -2,22 +2,34 @@
 language: pl
 tags:
 - herbert
-license: cc-by-sa-4.0
 ---
 # HerBERT
-**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
-using MLM and SSO objectives with dynamic masking of whole words.
 Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
 ## Tokenizer
-The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with
 a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
-We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``.
-## HerBERT usage
 Example code:
 ```python
 from transformers import AutoTokenizer, AutoModel
@@ -40,12 +52,29 @@ output = model(
 )
 ```
 ## License
-CC BY-SA 4.0
 ## Authors
-Model was trained by **Machine Learning Research Team at Allegro** and **Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**.
-You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>

 language: pl
 tags:
 - herbert
+license: cc-by-4.0
 ---
 # HerBERT
+**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish corpora
+using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: [HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish](https://www.aclweb.org/anthology/2021.bsnlp-1.1/).
 Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
+## Corpus
+HerBERT was trained on six different corpora available for Polish language:
+| Corpus | Tokens | Documents |
+| :------ | ------: | ------: |
+| [CCNet Middle](https://github.com/facebookresearch/cc_net) | 3243M  | 7.9M |
+| [CCNet Head](https://github.com/facebookresearch/cc_net) | 2641M  | 7.0M |
+| [National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0)| 1357M  | 3.9M |
+| [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1056M  | 1.1M
+| [Wikipedia](https://dumps.wikimedia.org/) | 260M  | 1.4M |
+| [Wolne Lektury](https://wolnelektury.pl/) | 41M  | 5.5k |
 ## Tokenizer
+The training dataset was tokenized into subwords using a character level byte-pair encoding (``CharBPETokenizer``) with
 a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
+We kindly encourage you to use the ``Fast`` version of the tokenizer, namely ``HerbertTokenizerFast``.
+## Usage
 Example code:
 ```python
 from transformers import AutoTokenizer, AutoModel
 )
 ```
 ## License
+CC BY 4.0
+## Citation
+If you use this model, please cite the following paper:
+```
+@inproceedings{mroczkowski-etal-2021-herbert,
+    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
+    author = "Mroczkowski, Robert  and
+      Rybak, Piotr  and
+      Wr{\'o}blewska, Alina  and
+      Gawlik, Ireneusz",
+    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
+    month = apr,
+    year = "2021",
+    address = "Kiyv, Ukraine",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
+    pages = "1--10",
+}
+```
 ## Authors
+The model was trained by **Machine Learning Research Team at Allegro** and [**Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**](http://zil.ipipan.waw.pl/).
+You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>