|
--- |
|
language: en |
|
--- |
|
# SciBERT |
|
|
|
This is the pretrained model presented in [SciBERT: A Pretrained Language Model for Scientific Text](https://www.aclweb.org/anthology/D19-1371/), which is a BERT model trained on scientific text. |
|
|
|
The training corpus was papers taken from [Semantic Scholar](https://www.semanticscholar.org). Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts. |
|
|
|
SciBERT has its own wordpiece vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions. |
|
|
|
Available models include: |
|
* `scibert_scivocab_cased` |
|
* `scibert_scivocab_uncased` |
|
|
|
|
|
The original repo can be found [here](https://github.com/allenai/scibert). |
|
|
|
If using these models, please cite the following paper: |
|
``` |
|
@inproceedings{beltagy-etal-2019-scibert, |
|
title = "SciBERT: A Pretrained Language Model for Scientific Text", |
|
author = "Beltagy, Iz and Lo, Kyle and Cohan, Arman", |
|
booktitle = "EMNLP", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://www.aclweb.org/anthology/D19-1371" |
|
} |
|
``` |
|
|