language: tr
license: mit
datasets:
- allenai/c4
๐น๐ท Turkish ELECTRA model
We present community-driven BERT, DistilBERT, ELECTRA and ConvBERT models for Turkish ๐
Some datasets used for pretraining and evaluation are contributed from the awesome Turkish NLP community, as well as the decision for the BERT model name: BERTurk.
Logo is provided by Merve Noyan.
Stats
We've also trained an ELECTRA (cased) model on the recently released Turkish part of the multiligual C4 (mC4) corpus from the AI2 team.
After filtering documents with a broken encoding, the training corpus has a size of 242GB resulting in 31,240,963,926 tokens.
We used the original 32k vocab (instead of creating a new one).
mC4 ELECTRA
In addition to the ELECTRA base model, we also trained an ELECTRA model on the Turkish part of the mC4 corpus. We use a sequence length of 512 over the full training time and train the model for 1M steps on a v3-32 TPU.
Model usage
All trained models can be used from the DBMDZ Hugging Face model hub page using their model name.
Example usage with ๐ค/Transformers:
tokenizer = AutoTokenizer.from_pretrained("dbmdz/electra-base-turkish-mc4-cased-discriminator")
model = AutoModel.from_pretrained("dbmdz/electra-base-turkish-mc4-cased-discriminator")
Citation
You can use the following BibTeX entry for citation:
@software{stefan_schweter_2020_3770924,
author = {Stefan Schweter},
title = {BERTurk - BERT models for Turkish},
month = apr,
year = 2020,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.3770924},
url = {https://doi.org/10.5281/zenodo.3770924}
}
Acknowledgments
Thanks to Kemal Oflazer for providing us additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing us the Turkish NER dataset for evaluation.
We would like to thank Merve Noyan for the awesome logo!
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC โค๏ธ