globuslabs
/

ScholarBERT_10_WB

multi-displinary

Inference Endpoints

Model card Files Files and versions Community

globuslabs commited on May 22, 2022

Commit

f3806b3

·

1 Parent(s): ba39e9e

Update README

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ license: apache-2.0
 This is the **ScholarBERT_10_WB** variant of the ScholarBERT model family.
-The model is pretrained on a large collection of scientific research articles (**2.2B tokens**).
 Additionally, the pretraining data also includes the Wikipedia+BookCorpus, which are used to pretrain the [BERT-base](https://huggingface.co/bert-base-cased) and [BERT-large](https://huggingface.co/bert-large-cased) models.
 This is a **cased** (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.
@@ -30,7 +30,7 @@ The model is based on the same architecture as [BERT-large](https://huggingface.
 # Training Dataset
-The vocab and the model are pertrained on **100% of the PRD** scientific literature dataset.
 The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”),
 a nonprofit organization based in California. This dataset was constructed from a corpus

 This is the **ScholarBERT_10_WB** variant of the ScholarBERT model family.
+The model is pretrained on a large collection of scientific research articles (**22.1B tokens**).
 Additionally, the pretraining data also includes the Wikipedia+BookCorpus, which are used to pretrain the [BERT-base](https://huggingface.co/bert-base-cased) and [BERT-large](https://huggingface.co/bert-large-cased) models.
 This is a **cased** (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.
 # Training Dataset
+The vocab and the model are pertrained on **10% of the PRD** scientific literature dataset and Wikipedia+BookCorpus.
 The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”),
 a nonprofit organization based in California. This dataset was constructed from a corpus