mkulkarni24
commited on
Commit
•
7e3196d
1
Parent(s):
d939379
Update README.md
Browse files
README.md
CHANGED
@@ -1,4 +1,106 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
Please direct all questions to mkulkarni24@bloomberg.net
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
---
|
4 |
+
|
5 |
+
# KeyBART
|
6 |
+
KeyBART as described in Learning Rich Representations of Keyphrase from Text (https://arxiv.org/pdf/2112.08547.pdf), pre-trains a BART-based architecture to produce a concatenated sequence of keyphrases in the CatSeqD format.
|
7 |
+
|
8 |
+
We provide some examples on Downstream Evaluations setups and and also how it can be used for Text-to-Text Generation in a zero-shot setting.
|
9 |
+
|
10 |
+
## Downstream Evaluation
|
11 |
+
|
12 |
+
### Keyphrase Generation
|
13 |
+
```
|
14 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
15 |
+
|
16 |
+
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KeyBART")
|
17 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("bloomberg/KeyBART")
|
18 |
+
|
19 |
+
from datasets import load_dataset
|
20 |
+
|
21 |
+
dataset = load_dataset("midas/kp20k")
|
22 |
+
```
|
23 |
+
|
24 |
+
Reported Results:
|
25 |
+
|
26 |
+
#### Present Keyphrase Generation
|
27 |
+
| | Inspec | | NUS | | Krapivin | | SemEval | | KP20k | |
|
28 |
+
|---------------|--------|-------|-------|-------|----------|-------|---------|-------|-------|-------|
|
29 |
+
| Model | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M |
|
30 |
+
| catSeq | 22.5 | 26.2 | 32.3 | 39.7 | 26.9 | 35.4 | 24.2 | 28.3 | 29.1 | 36.7 |
|
31 |
+
| catSeqTG | 22.9 | 27 | 32.5 | 39.3 | 28.2 | 36.6 | 24.6 | 29.0 | 29.2 | 36.6 |
|
32 |
+
| catSeqTG-2RF1 | 25.3 | 30.1 | 37.5 | 43.3 | 30 | 36.9 | 28.7 | 32.9 | 32.1 | 38.6 |
|
33 |
+
| GANMR | 25.8 | 29.9 | 34.8 | 41.7 | 28.8 | 36.9 | N/A | N/A | 30.3 | 37.8 |
|
34 |
+
| ExHiRD-h | 25.3 | 29.1 | N/A | N/A | 28.6 | 34.7 | 28.4 | 33.5 | 31.1 | 37.4 |
|
35 |
+
| Transformer (Ye et al., 2021) | 28.15 | 32.56 | 37.07 | 41.91 | 31.58 | 36.55 | 28.71 | 32.52 | 33.21 | 37.71 |
|
36 |
+
| BART* | 23.59 | 28.46 | 35.00 | 42.65 | 26.91 | 35.37 | 26.72 | 31.91 | 29.25 | 37.51 |
|
37 |
+
| KeyBART-DOC* | 24.42 | 29.57 | 31.37 | 39.24 | 24.21 | 32.60 | 24.69 | 30.50 | 28.82 | 37.59 |
|
38 |
+
| KeyBART* | 24.49 | 29.69 | 34.77 | 43.57 | 29.24 | 38.62 | 27.47 | 33.54 | 30.71 | 39.76 |
|
39 |
+
| KeyBART* (Zero-shot) | 30.72 | 36.89 | 18.86 | 21.67 | 18.35 | 20.46 | 20.25 | 25.82 | 12.57 | 15.41 |
|
40 |
+
|
41 |
+
#### Absent Keyphrase Generation
|
42 |
+
| | Inspec | | NUS | | Krapivin | | SemEval | | KP20k | |
|
43 |
+
|---------------|--------|------|------|------|----------|------|---------|------|-------|------|
|
44 |
+
| Model | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M | F1@5 | F1@M |
|
45 |
+
| catSeq | 0.4 | 0.8 | 1.6 | 2.8 | 1.8 | 3.6 | 1.6 | 2.8 | 1.5 | 3.2 |
|
46 |
+
| catSeqTG | 0.5 | 1.1 | 1.1 | 1.8 | 1.8 | 3.4 | 1.1 | 1.8 | 1.5 | 3.2 |
|
47 |
+
| catSeqTG-2RF1 | 1.2 | 2.1 | 1.9 | 3.1 | 3.0 | 5.3 | 2.1 | 3.0 | 2.7 | 5.0 |
|
48 |
+
| GANMR | 1.3 | 1.9 | 2.6 | 3.8 | 4.2 | 5.7 | N/A | N/A | 3.2 | 4.5 |
|
49 |
+
| ExHiRD-h | 1.1 | 2.2 | N/A | N/A | 2.2 | 4.3 | 1.7 | 2.5 | 1.6 | 3.2 |
|
50 |
+
| Transformer (Ye et al., 2021) | 1.02 | 1.94 | 2.82 | 4.82 | 3.21 | 6.04 | 2.05 | 2.33 | 2.31 | 4.61 |
|
51 |
+
| BART* | 1.08 | 1.96 | 1.80 | 2.75 | 2.59 | 4.91 | 1.34 | 1.75 | 1.77 | 3.56 |
|
52 |
+
| KeyBART-DOC* | 0.99 | 2.03 | 1.39 | 2.74 | 2.40 | 4.58 | 1.07 | 1.39 | 1.69 | 3.38 |
|
53 |
+
| KeyBART* | 0.95 | 1.81 | 1.23 | 1.90 | 3.09 | 6.08 | 1.96 | 2.65 | 2.03 | 4.26 |
|
54 |
+
| KeyBART* (Zero-shot) | 1.83 | 2.92 | 1.46 | 2.19 | 1.29 | 2.09 | 1.12 | 1.45 | 0.70 | 1.14 |
|
55 |
+
|
56 |
+
|
57 |
+
### Abstractive Summarization
|
58 |
+
```
|
59 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
60 |
+
|
61 |
+
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KeyBART")
|
62 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("bloomberg/KeyBART")
|
63 |
+
|
64 |
+
from datasets import load_dataset
|
65 |
+
|
66 |
+
dataset = load_dataset("cnn_dailymail")
|
67 |
+
```
|
68 |
+
|
69 |
+
Reported Results:
|
70 |
+
|
71 |
+
| Model | R1 | R2 | RL |
|
72 |
+
|--------------|-------|-------|-------|
|
73 |
+
| BART (Lewis et al., 2019) | 44.16 | 21.28 | 40.9 |
|
74 |
+
| BART* | 42.93 | 20.12 | 39.72 |
|
75 |
+
| KeyBART-DOC* | 42.92 | 20.07 | 39.69 |
|
76 |
+
| KeyBART* | 43.10 | 20.26 | 39.90 |
|
77 |
+
|
78 |
+
## Zero-shot settings
|
79 |
+
```
|
80 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
81 |
+
|
82 |
+
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KeyBART")
|
83 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("bloomberg/KeyBART")
|
84 |
+
```
|
85 |
+
|
86 |
+
Alternatively use the Hosted Inference API console provided in https://huggingface.co/bloomberg/KeyBART
|
87 |
+
|
88 |
+
Sample Zero Shot result:
|
89 |
+
|
90 |
+
```
|
91 |
+
Input: In this work, we explore how to learn task specific language models aimed towards learning rich representation of keyphrases from text documents.
|
92 |
+
We experiment with different masking strategies for pre-training transformer language models (LMs) in discriminative as well as generative settings.
|
93 |
+
In the discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with Replacement (KBIR),
|
94 |
+
showing large gains in performance (upto 9.26 points in F1) over SOTA, when LM pre-trained using KBIR is fine-tuned for the task of keyphrase extraction.
|
95 |
+
In the generative setting, we introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the input text in the CatSeq
|
96 |
+
format, instead of the denoised original input. This also led to gains in performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation.
|
97 |
+
Additionally, we also fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), relation extraction (RE),
|
98 |
+
abstractive summarization and achieve comparable performance with that of the SOTA, showing that learning rich representation of keyphrases is indeed beneficial
|
99 |
+
for many other fundamental NLP tasks.
|
100 |
+
|
101 |
+
Output: language model;keyphrase generation;new pre-training objective;pre-training setup;
|
102 |
+
|
103 |
+
```
|
104 |
+
|
105 |
+
|
106 |
Please direct all questions to mkulkarni24@bloomberg.net
|