|
---
|
|
license: apache-2.0
|
|
language: en
|
|
datasets:
|
|
- wikipedia
|
|
- bookcorpus
|
|
tags:
|
|
- bert
|
|
- exbert
|
|
- linkbert
|
|
- feature-extraction
|
|
- fill-mask
|
|
- question-answering
|
|
- text-classification
|
|
- token-classification
|
|
---
|
|
|
|
## LinkBERT-base
|
|
|
|
LinkBERT-base model pretrained on English Wikipedia articles along with hyperlink information. It is introduced in the paper [LinkBERT: Pretraining Language Models with Document Links (ACL 2022)](https://arxiv.org/abs/2203.15827). The code and data are available in [this repository](https://github.com/michiyasunaga/LinkBERT).
|
|
|
|
|
|
## Model description
|
|
|
|
LinkBERT is a transformer encoder (BERT-like) model pretrained on a large corpus of documents. It is an improvement of BERT that newly captures **document links** such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides a single document.
|
|
|
|
LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for **knowledge-intensive** tasks (e.g. question answering) and **cross-document** tasks (e.g. reading comprehension, document retrieval).
|
|
|
|
|
|
## Intended uses & limitations
|
|
|
|
The model can be used by fine-tuning on a downstream task, such as question answering, sequence classification, and token classification.
|
|
You can also use the raw model for feature extraction (i.e. obtaining embeddings for input text).
|
|
|
|
|
|
### How to use
|
|
|
|
To use the model to get the features of a given text in PyTorch:
|
|
|
|
```python
|
|
from transformers import AutoTokenizer, AutoModel
|
|
tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-base')
|
|
model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-base')
|
|
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
|
|
outputs = model(**inputs)
|
|
last_hidden_states = outputs.last_hidden_state
|
|
```
|
|
|
|
For fine-tuning, you can use [this repository](https://github.com/michiyasunaga/LinkBERT) or follow any other BERT fine-tuning codebases.
|
|
|
|
|
|
## Evaluation results
|
|
|
|
When fine-tuned on downstream tasks, LinkBERT achieves the following results.
|
|
|
|
**General benchmarks ([MRQA](https://github.com/mrqa/MRQA-Shared-Task-2019) and [GLUE](https://gluebenchmark.com/)):**
|
|
|
|
| | HotpotQA | TriviaQA | SearchQA | NaturalQ | NewsQA | SQuAD | GLUE |
|
|
| ---------------------- | -------- | -------- | -------- | -------- | ------ | ----- | -------- |
|
|
| | F1 | F1 | F1 | F1 | F1 | F1 | Avg score |
|
|
| BERT-base | 76.0 | 70.3 | 74.2 | 76.5 | 65.7 | 88.7 | 79.2 |
|
|
| **LinkBERT-base** | **78.2** | **73.9** | **76.8** | **78.3** | **69.3** | **90.1** | **79.6** |
|
|
| BERT-large | 78.1 | 73.7 | 78.3 | 79.0 | 70.9 | 91.1 | 80.7 |
|
|
| **LinkBERT-large** | **80.8** | **78.2** | **80.5** | **81.0** | **72.6** | **92.7** | **81.1** |
|
|
|
|
|
|
## Citation
|
|
|
|
If you find LinkBERT useful in your project, please cite the following:
|
|
|
|
```bibtex
|
|
@InProceedings{yasunaga2022linkbert,
|
|
author = {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
|
|
title = {LinkBERT: Pretraining Language Models with Document Links},
|
|
year = {2022},
|
|
booktitle = {Association for Computational Linguistics (ACL)},
|
|
}
|
|
```
|
|
|