ScholarBERT_100_64bit Model

This is the ScholarBERT_100_64bit variant of the ScholarBERT model family. The difference between this variant and the ScholarBERT_100 model is that its tokenizer is trained with int64 rather than the default int32, so the count of very frequent tokens (e.g., "the") does not overflow.

The model is pretrained on a large collection of scientific research articles (221B tokens).

This is a cased (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.

The model is based on the same architecture as BERT-large and has a total of 340M parameters.

Model Architecture

Hyperparameter	Value
Layers	24
Hidden Size	1024
Attention Heads	16
Total Parameters	340M

Training Dataset

The vocab and the model are pertrained on 100% of the PRD scientific literature dataset.

The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”), a nonprofit organization based in California. This dataset was constructed from a corpus of journal article files, from which We successfully extracted text from 75,496,055 articles from 178,928 journals. The articles span across Arts & Humanities, Life Sciences & Biomedicine, Physical Sciences, Social Sciences, and Technology. The distribution of articles is shown below.

BibTeX entry and citation info

If using this model, please cite this paper:

@inproceedings{hong2023diminishing,
  title={The diminishing returns of masked language models to science},
  author={Hong, Zhi and Ajith, Aswathy and Pauloski, James and Duede, Eamon and Chard, Kyle and Foster, Ian},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2023},
  pages={1270--1283},
  year={2023}
}