globuslabs
commited on
Commit
·
f3806b3
1
Parent(s):
ba39e9e
Update README
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ license: apache-2.0
|
|
10 |
|
11 |
This is the **ScholarBERT_10_WB** variant of the ScholarBERT model family.
|
12 |
|
13 |
-
The model is pretrained on a large collection of scientific research articles (**
|
14 |
Additionally, the pretraining data also includes the Wikipedia+BookCorpus, which are used to pretrain the [BERT-base](https://huggingface.co/bert-base-cased) and [BERT-large](https://huggingface.co/bert-large-cased) models.
|
15 |
|
16 |
This is a **cased** (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.
|
@@ -30,7 +30,7 @@ The model is based on the same architecture as [BERT-large](https://huggingface.
|
|
30 |
|
31 |
# Training Dataset
|
32 |
|
33 |
-
The vocab and the model are pertrained on **
|
34 |
|
35 |
The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”),
|
36 |
a nonprofit organization based in California. This dataset was constructed from a corpus
|
|
|
10 |
|
11 |
This is the **ScholarBERT_10_WB** variant of the ScholarBERT model family.
|
12 |
|
13 |
+
The model is pretrained on a large collection of scientific research articles (**22.1B tokens**).
|
14 |
Additionally, the pretraining data also includes the Wikipedia+BookCorpus, which are used to pretrain the [BERT-base](https://huggingface.co/bert-base-cased) and [BERT-large](https://huggingface.co/bert-large-cased) models.
|
15 |
|
16 |
This is a **cased** (case-sensitive) model. The tokenizer will not convert all inputs to lower-case by default.
|
|
|
30 |
|
31 |
# Training Dataset
|
32 |
|
33 |
+
The vocab and the model are pertrained on **10% of the PRD** scientific literature dataset and Wikipedia+BookCorpus.
|
34 |
|
35 |
The PRD dataset is provided by Public.Resource.Org, Inc. (“Public Resource”),
|
36 |
a nonprofit organization based in California. This dataset was constructed from a corpus
|