|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# KeyBART |
|
KeyBART as described in Learning Rich Representations of Keyphrase from Text (https://arxiv.org/pdf/2112.08547.pdf), pre-trains a BART-based architecture to produce a concatenated sequence of keyphrases in the CatSeqD format. |
|
|
|
We provide some examples on Downstream Evaluations setups and and also how it can be used for Text-to-Text Generation in a zero-shot setting. |
|
|
|
## Downstream Evaluation |
|
|
|
### Keyphrase Generation |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KeyBART") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("bloomberg/KeyBART") |
|
|
|
from datasets import load_dataset |
|
|
|
dataset = load_dataset("midas/kp20k") |
|
``` |
|
|
|
Reported Results: |
|
|
|
#### Present Keyphrase Generation |
|
| | Inspec | | NUS | | Krapivin | | SemEval | | KP20k | | |
|
|---------------|--------|-------|-------|-------|----------|-------|---------|-------|-------|-------| |
|
| Model | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | |
|
| catSeq | 22.5 | 26.2 | 32.3 | 39.7 | 26.9 | 35.4 | 24.2 | 28.3 | 29.1 | 36.7 | |
|
| catSeqTG | 22.9 | 27 | 32.5 | 39.3 | 28.2 | 36.6 | 24.6 | 29.0 | 29.2 | 36.6 | |
|
| catSeqTG-2RF1 | 25.3 | 30.1 | 37.5 | 43.3 | 30 | 36.9 | 28.7 | 32.9 | 32.1 | 38.6 | |
|
| GANMR | 25.8 | 29.9 | 34.8 | 41.7 | 28.8 | 36.9 | N/A | N/A | 30.3 | 37.8 | |
|
| ExHiRD-h | 25.3 | 29.1 | N/A | N/A | 28.6 | 34.7 | 28.4 | 33.5 | 31.1 | 37.4 | |
|
| Transformer (Ye et al., 2021) | 28.15 | 32.56 | 37.07 | 41.91 | 31.58 | 36.55 | 28.71 | 32.52 | 33.21 | 37.71 | |
|
| BART* | 23.59 | 28.46 | 35.00 | 42.65 | 26.91 | 35.37 | 26.72 | 31.91 | 29.25 | 37.51 | |
|
| KeyBART-DOC* | 24.42 | 29.57 | 31.37 | 39.24 | 24.21 | 32.60 | 24.69 | 30.50 | 28.82 | 37.59 | |
|
| KeyBART* | 24.49 | 29.69 | 34.77 | 43.57 | 29.24 | 38.62 | 27.47 | 33.54 | 30.71 | 39.76 | |
|
| KeyBART* (Zero-shot) | 30.72 | 36.89 | 18.86 | 21.67 | 18.35 | 20.46 | 20.25 | 25.82 | 12.57 | 15.41 | |
|
|
|
#### Absent Keyphrase Generation |
|
| | Inspec | | NUS | | Krapivin | | SemEval | | KP20k | | |
|
|---------------|--------|------|------|------|----------|------|---------|------|-------|------| |
|
| Model | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | |
|
| catSeq | 0.4 | 0.8 | 1.6 | 2.8 | 1.8 | 3.6 | 1.6 | 2.8 | 1.5 | 3.2 | |
|
| catSeqTG | 0.5 | 1.1 | 1.1 | 1.8 | 1.8 | 3.4 | 1.1 | 1.8 | 1.5 | 3.2 | |
|
| catSeqTG-2RF1 | 1.2 | 2.1 | 1.9 | 3.1 | 3.0 | 5.3 | 2.1 | 3.0 | 2.7 | 5.0 | |
|
| GANMR | 1.3 | 1.9 | 2.6 | 3.8 | 4.2 | 5.7 | N/A | N/A | 3.2 | 4.5 | |
|
| ExHiRD-h | 1.1 | 2.2 | N/A | N/A | 2.2 | 4.3 | 1.7 | 2.5 | 1.6 | 3.2 | |
|
| Transformer (Ye et al., 2021) | 1.02 | 1.94 | 2.82 | 4.82 | 3.21 | 6.04 | 2.05 | 2.33 | 2.31 | 4.61 | |
|
| BART* | 1.08 | 1.96 | 1.80 | 2.75 | 2.59 | 4.91 | 1.34 | 1.75 | 1.77 | 3.56 | |
|
| KeyBART-DOC* | 0.99 | 2.03 | 1.39 | 2.74 | 2.40 | 4.58 | 1.07 | 1.39 | 1.69 | 3.38 | |
|
| KeyBART* | 0.95 | 1.81 | 1.23 | 1.90 | 3.09 | 6.08 | 1.96 | 2.65 | 2.03 | 4.26 | |
|
| KeyBART* (Zero-shot) | 1.83 | 2.92 | 1.46 | 2.19 | 1.29 | 2.09 | 1.12 | 1.45 | 0.70 | 1.14 | |
|
|
|
|
|
### Abstractive Summarization |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KeyBART") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("bloomberg/KeyBART") |
|
|
|
from datasets import load_dataset |
|
|
|
dataset = load_dataset("cnn_dailymail") |
|
``` |
|
|
|
Reported Results: |
|
|
|
| Model | R1 | R2 | RL | |
|
|--------------|-------|-------|-------| |
|
| BART (Lewis et al., 2019) | 44.16 | 21.28 | 40.9 | |
|
| BART* | 42.93 | 20.12 | 39.72 | |
|
| KeyBART-DOC* | 42.92 | 20.07 | 39.69 | |
|
| KeyBART* | 43.10 | 20.26 | 39.90 | |
|
|
|
## Zero-shot settings |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KeyBART") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("bloomberg/KeyBART") |
|
``` |
|
|
|
Alternatively use the Hosted Inference API console provided in https://huggingface.co/bloomberg/KeyBART |
|
|
|
Sample Zero Shot result: |
|
|
|
``` |
|
Input: In this work, we explore how to learn task specific language models aimed towards learning rich representation of keyphrases from text documents. |
|
We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings. |
|
In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR), |
|
showing large gains in performance (upto 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. |
|
In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq |
|
format, instead of the denoised original input. This also led to gains in performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation. |
|
Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE), |
|
abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial |
|
for many other fundamental NLP tasks. |
|
|
|
Output: language model;keyphrase generation;new pre-training objective;pre-training setup; |
|
|
|
``` |
|
|
|
|
|
Please direct all questions to mkulkarni24@bloomberg.net |