fermi-bert-512: Pretrained BERT for Nuclear Power

A BERT model optimized for the nuclear energy domain, fermi-bert-512 is pretrained on a combination of Wikipedia (2023), Books3, and a subset of the U.S. Nuclear Regulatory Commission’s ADAMS database. It is specifically designed to handle the complex technical jargon and regulatory language unique to the nuclear industry. Trained on the Oak Ridge National Laboratory Frontier supercomputer using 64 MI250X AMD GPUs over a 10-hour period, this model provides a robust foundation for fine-tuning in nuclear-related applications.

Training

fermi-bert-512 is a BERT model pretrained on wikipedia (2023), Books3, and ADAMS with a max sequence length of 512.

We make several modifications to the standard BERT training procedure:

We use a custom nuclear-optimized WordPiece tokenizer to better represent the unique jargon and technical terminology specific to the nuclear industry.
We train on a subset of U.S. Nuclear Regulatory Commission’s Agency-wide Documents Access and Management System (ADAMS).
We train on Books3 rather than BookCorpus.
We use larger batch size and other improved hyper parameters as described in RoBERTa.

Evaluation

We evaluate the quality of fermi-bert-512 on the standard GLUE benchmark (script). We find it performs comparably to other BERT models but with the advantage of performing better on documents in the nuclear energy space as demonstrated by our downstream fine-tuning.

Model	Bsz	Steps	Seq	Avg	Cola	SST2	MRPC	STSB	QQP	MNLI	QNLI	RTE
bert-base-uncased	256	1M	512	0.81	0.56	0.82	0.86	0.88	0.91	0.84	0.91	0.67
roberta-base	8K	500k	512	0.84	0.56	0.94	0.88	0.90	0.92	0.88	0.92	0.74
fermi-bert-512	4k	100k	512	0.83	0.60	0.93	0.88	0.89	0.91	0.87	0.91	0.68
fermi-bert-1024	4k	100k	1024	0.83	0.6	0.93	0.86	0.89	0.91	0.86	0.92	0.69

Pretraining Data

We train on 40% Wikipedia, 30% Books3, 30% ADAMS. We pack and tokenize the sequences to 512 tokens. If a document is shorter than 512 tokens, we append another document until it is 512 tokens. If a document is longer than 512 tokens we split it into multiple documents. For 10% of the Wikipedia documents, we do not concatenate short documents. See M2-Bert for rationale behind including short documents.

Usage

from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained('atomic-canyon/fermi-bert-512') # `fermi-bert` uses a nuclear specific tokenizer
model = AutoModelForMaskedLM.from_pretrained('atomic-canyon/fermi-bert-512')

# To use this model directly for masked language modeling
classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer, device="cpu")

print(classifier("I [MASK] to the store yesterday."))

Acknowledgement

This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

atomic-canyon
/

fermi-bert-512

You need to agree to share your contact information to access this model

fermi-bert-512: Pretrained BERT for Nuclear Power

Training

Evaluation

Pretraining Data

Usage

Acknowledgement

Model tree for atomic-canyon/fermi-bert-512

Collection including atomic-canyon/fermi-bert-512

Fermi