|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- allenai/scirepeval |
|
language: |
|
- en |
|
--- |
|
# SPECTER 2.0 (Base) |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
SPECTER 2.0 is the successor to [SPECTER](allenai/specter) and is capable of generating task specific embeddings for scientific tasks when paired with [adapters](https://huggingface.co/models?search=allenai/specter-2_). |
|
Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications. |
|
|
|
**Note:** To get the best performance on a downstream task type please load the associated adapter with the base model as [below]() |
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
SPECTER 2.0 has been trained on over 6M triplets of scientific paper citations, which are available [here](https://huggingface.co/datasets/allenai/scirepeval/viewer/cite_prediction_new/evaluation). |
|
Post that it is trained on all the [SciRepEval](https://huggingface.co/datasets/allenai/scirepeval) training tasks, with task format specific adapters. |
|
|
|
Task Formats trained on: |
|
- Classification |
|
- Regression |
|
- Proximity |
|
- Adhoc Search |
|
|
|
|
|
It builds on the work done in [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://api.semanticscholar.org/CorpusID:254018137) and we evaluate the trained model on this benchmark as well. |
|
|
|
|
|
|
|
- **Developed by:** Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman |
|
- **Shared by :** Allen AI |
|
- **Model type:** bert-base-uncased + adapters |
|
- **License:** Apache 2.0 |
|
- **Finetuned from model:** [allenai/scibert](https://huggingface.co/allenai/scibert_scivocab_uncased). |
|
|
|
## Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [https://github.com/allenai/SPECTER2_0](https://github.com/allenai/SPECTER2_0) |
|
- **Paper:** [https://api.semanticscholar.org/CorpusID:254018137](https://api.semanticscholar.org/CorpusID:254018137) |
|
- **Demo:** [Usage](https://github.com/allenai/SPECTER2_0/blob/main/README.md) |
|
|
|
# Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
## Direct Use |
|
|
|
|Model|Name and HF link|Description| |
|
|--|--|--| |
|
|Retrieval*|[allenai/specter2_proximity](https://huggingface.co/allenai/specter2_proximity)|Encode papers as queries and candidates eg. Link Prediction, Nearest Neighbor Search| |
|
|Adhoc Query|[allenai/specter2_adhoc_query](https://huggingface.co/allenai/specter2_adhoc_query)|Encode short raw text queries for search tasks. (Candidate papers can be encoded with proximity)| |
|
|Classification|[allenai/specter2_classification](https://huggingface.co/allenai/specter2_classification)|Encode papers to feed into linear classifiers as features| |
|
|Regression|[allenai/specter2_regression](https://huggingface.co/allenai/specter2_regression)|Encode papers to feed into linear regressors as features| |
|
|
|
*Retrieval model should suffice for downstream task types not mentioned above |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
# load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2') |
|
|
|
#load base model |
|
model = AutoModel.from_pretrained('allenai/specter2') |
|
|
|
#load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it |
|
model.load_adapter("allenai/specter2_proximity", source="hf", load_as="proximity", set_active=True) |
|
#other possibilities: allenai/specter2_<classification|regression|adhoc_query> |
|
|
|
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'}, |
|
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}] |
|
|
|
# concatenate title and abstract |
|
text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers] |
|
# preprocess the input |
|
inputs = self.tokenizer(text_batch, padding=True, truncation=True, |
|
return_tensors="pt", return_token_type_ids=False, max_length=512) |
|
output = model(**inputs) |
|
# take the first token in the batch as the embedding |
|
embeddings = output.last_hidden_state[:, 0, :] |
|
``` |
|
|
|
## Downstream Use |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
For evaluation and downstream usage, please refer to [https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md](https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md). |
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The base model is trained on citation links between papers and the adapters are trained on 8 large scale tasks across the four formats. |
|
All the data is a part of SciRepEval benchmark and is available [here](https://huggingface.co/datasets/allenai/scirepeval). |
|
|
|
The citation link are triplets in the form |
|
|
|
```json |
|
{"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}} |
|
``` |
|
|
|
consisting of a query paper, a positive citation and a negative which can be from the same/different field of study as the query or citation of a citation. |
|
|
|
## Training Procedure |
|
|
|
Please refer to the [SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677). |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
The model is trained in two stages using [SciRepEval](https://github.com/allenai/scirepeval/blob/main/training/TRAINING.md): |
|
- Base Model: First a base model is trained on the above citation triplets. |
|
``` batch size = 1024, max input length = 512, learning rate = 2e-5, epochs = 2 warmup steps = 10% fp16``` |
|
- Adapters: Thereafter, task format specific adapters are trained on the SciRepEval training tasks, where 600K triplets are sampled from above and added to the training data as well. |
|
``` batch size = 256, max input length = 512, learning rate = 1e-4, epochs = 6 warmup = 1000 steps fp16``` |
|
|
|
|
|
# Evaluation |
|
|
|
We evaluate the model on [SciRepEval](https://github.com/allenai/scirepeval), a large scale eval benchmark for scientific embedding tasks which which has [SciDocs] as a subset. |
|
We also evaluate and establish a new SoTA on [MDCR](https://github.com/zoranmedic/mdcr), a large scale citation recommendation benchmark. |
|
|
|
|Model|SciRepEval In-Train|SciRepEval Out-of-Train|SciRepEval Avg|MDCR(MAP, Recall@5)| |
|
|--|--|--|--|--| |
|
|[BM-25](https://api.semanticscholar.org/CorpusID:252199740)|n/a|n/a|n/a|(33.7, 28.5)| |
|
|[SPECTER](https://huggingface.co/allenai/specter)|54.7|57.4|68.0|(30.6, 25.5)| |
|
|[SciNCL](https://huggingface.co/malteos/scincl)|55.6|57.8|69.0|(32.6, 27.3)| |
|
|[SciRepEval-Adapters](https://huggingface.co/models?search=scirepeval)|61.9|59.0|70.9|(35.3, 29.6)| |
|
|[SPECTER 2.0-Adapters](https://huggingface.co/models?search=allenai/specter-2)|**62.3**|**59.2**|**71.2**|**(38.4, 33.0)**| |
|
|
|
Please cite the following works if you end up using SPECTER 2.0: |
|
|
|
[SPECTER paper](https://api.semanticscholar.org/CorpusID:215768677): |
|
|
|
```bibtex |
|
@inproceedings{specter2020cohan, |
|
title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}}, |
|
author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld}, |
|
booktitle={ACL}, |
|
year={2020} |
|
} |
|
``` |
|
[SciRepEval paper](https://api.semanticscholar.org/CorpusID:254018137) |
|
```bibtex |
|
@article{Singh2022SciRepEvalAM, |
|
title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations}, |
|
author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman}, |
|
journal={ArXiv}, |
|
year={2022}, |
|
volume={abs/2211.13308} |
|
} |
|
``` |
|
|
|
|
|
|