Update README.md
Browse files
README.md
CHANGED
@@ -7,6 +7,95 @@ tags:
|
|
7 |
# Model Card for qcpg-sentences
|
8 |
|
9 |
# Model Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
## Model Description
|
12 |
The model creators note in the [associated paper](https://arxiv.org/pdf/2203.10940.pdf):
|
|
|
7 |
# Model Card for qcpg-sentences
|
8 |
|
9 |
# Model Details
|
10 |
+
|
11 |
+
# Quality Controlled Paraphrase Generation (ACL 2022)
|
12 |
+
> Paraphrase generation has been widely used in various downstream tasks. Most tasks benefit mainly from high quality paraphrases, namely those that are semantically similar to, yet linguistically diverse from, the original sentence. Generating high-quality paraphrases is challenging as it becomes increasingly hard to preserve meaning as linguistic diversity increases. Recent works achieve nice results by controlling specific aspects of the paraphrase, such as its syntactic tree. However, they do not allow to directly control the quality of the generated paraphrase, and suffer from low flexibility and scalability.
|
13 |
+
|
14 |
+
<img src="/assets/images/ilus.jpg" width="40%">
|
15 |
+
|
16 |
+
> Here we propose `QCPG`, a quality-guided controlled paraphrase generation model, that allows directly controlling the quality dimensions. Furthermore, we suggest a method that given a sentence, identifies points in the quality control space that are expected to yield optimal generated paraphrases. We show that our method is able to generate paraphrases which maintain the original meaning while achieving higher diversity than the uncontrolled baseline.
|
17 |
+
|
18 |
+
## Training, Evaluation and Inference
|
19 |
+
The code for training, evaluation and inference for both `QCPG` and `QP` is located in the dedicated directories. Scripts necassery for reproducing the experiments can be found in the `QCPG/scripts`, `QP/scripts` directories.
|
20 |
+
|
21 |
+
Make sure to run `QCPG/scripts/prepare_data.sh` and set the missing datasets directories accordingly before training!
|
22 |
+
|
23 |
+
<img src="/assets/images/arch.png" width="90%">
|
24 |
+
|
25 |
+
## Trained Models
|
26 |
+
|
27 |
+
```
|
28 |
+
!!! Notice !!! Our results show that on avarage QCPG is follwing the quality conditions and capable of generating higher quality greedy-sampled paraphrases then finetuned model. It does not mean it will output perfect paraphrases all the time!!! In practice, for best preformence, we highly reccomend: (1) Find the right quality control values (2) Use more sophisticated sampling methods (3) Apply post-generation monitoring and filtering.
|
29 |
+
```
|
30 |
+
|
31 |
+
|
32 |
+
[`qcpg-questions`](https://huggingface.co/ibm/qcpg-questions) (Trained on `data/wikians`)
|
33 |
+
|
34 |
+
[`qcpg-sentences`](https://huggingface.co/ibm/qcpg-sentences) (Trained on `data/parabk2`)
|
35 |
+
|
36 |
+
[`qcpg-captions`](https://huggingface.co/ibm/qcpg-captions) (Trained on `data/mscoco`)
|
37 |
+
|
38 |
+
## Usage
|
39 |
+
|
40 |
+
The best way to use the model is with the following code:
|
41 |
+
```python
|
42 |
+
from transformers import pipeline
|
43 |
+
|
44 |
+
class QualityControlPipeline:
|
45 |
+
|
46 |
+
def __init__(self, type):
|
47 |
+
assert type in ['captions', 'questions', 'sentences']
|
48 |
+
self.pipe = pipeline('text2text-generation', model=f'ibm/qcpg-{type}')
|
49 |
+
self.ranges = {
|
50 |
+
'captions': {'lex': [0, 90], 'syn': [0, 80], 'sem': [0, 95]},
|
51 |
+
'sentences': {'lex': [0, 100], 'syn': [0, 80], 'sem': [0, 95]},
|
52 |
+
'questions': {'lex': [0, 90], 'syn': [0, 75], 'sem': [0, 95]}
|
53 |
+
}[type]
|
54 |
+
|
55 |
+
def __call__(self, text, lexical, syntactic, semantic, **kwargs):
|
56 |
+
assert all([0 <= val <= 1 for val in [lexical, syntactic, semantic]]), \
|
57 |
+
f' control values must be between 0 and 1, got {lexical}, {syntactic}, {semantic}'
|
58 |
+
names = ['semantic_sim', 'lexical_div', 'syntactic_div']
|
59 |
+
control = [int(5 * round(val * 100 / 5)) for val in [semantic, lexical, syntactic]]
|
60 |
+
control ={name: max(min(val , self.ranges[name[:3]][1]), self.ranges[name[:3]][0]) for name, val in zip(names, control)}
|
61 |
+
control = [f'COND_{name.upper()}_{control[name]}' for name in names]
|
62 |
+
assert all(cond in self.pipe.tokenizer.additional_special_tokens for cond in control)
|
63 |
+
text = ' '.join(control) + text if isinstance(text, str) else [' '.join(control) for t in text]
|
64 |
+
return self.pipe(text, **kwargs)
|
65 |
+
```
|
66 |
+
|
67 |
+
Loading:
|
68 |
+
```python
|
69 |
+
model = QualityControlPipeline('sentences')
|
70 |
+
```
|
71 |
+
|
72 |
+
Generation with quality controlls:
|
73 |
+
```python
|
74 |
+
model('Is this going to work or what are we doing here?', lexical=0.3, syntactic=0.5, semantic=0.8)
|
75 |
+
```
|
76 |
+
Output: `[{'generated_text': "Will it work or what is it we're doing?"}]`
|
77 |
+
|
78 |
+
|
79 |
+
## Citation
|
80 |
+
```
|
81 |
+
@inproceedings{bandel-etal-2022-quality,
|
82 |
+
title = "Quality Controlled Paraphrase Generation",
|
83 |
+
author = "Bandel, Elron and
|
84 |
+
Aharonov, Ranit and
|
85 |
+
Shmueli-Scheuer, Michal and
|
86 |
+
Shnayderman, Ilya and
|
87 |
+
Slonim, Noam and
|
88 |
+
Ein-Dor, Liat",
|
89 |
+
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
|
90 |
+
month = may,
|
91 |
+
year = "2022",
|
92 |
+
address = "Dublin, Ireland",
|
93 |
+
publisher = "Association for Computational Linguistics",
|
94 |
+
url = "https://aclanthology.org/2022.acl-long.45",
|
95 |
+
pages = "596--609",
|
96 |
+
abstract = "Paraphrase generation has been widely used in various downstream tasks. Most tasks benefit mainly from high quality paraphrases, namely those that are semantically similar to, yet linguistically diverse from, the original sentence. Generating high-quality paraphrases is challenging as it becomes increasingly hard to preserve meaning as linguistic diversity increases. Recent works achieve nice results by controlling specific aspects of the paraphrase, such as its syntactic tree. However, they do not allow to directly control the quality of the generated paraphrase, and suffer from low flexibility and scalability. Here we propose QCPG, a quality-guided controlled paraphrase generation model, that allows directly controlling the quality dimensions. Furthermore, we suggest a method that given a sentence, identifies points in the quality control space that are expected to yield optimal generated paraphrases. We show that our method is able to generate paraphrases which maintain the original meaning while achieving higher diversity than the uncontrolled baseline. The models, the code, and the data can be found in https://github.com/IBM/quality-controlled-paraphrase-generation.",
|
97 |
+
}
|
98 |
+
```
|
99 |
|
100 |
## Model Description
|
101 |
The model creators note in the [associated paper](https://arxiv.org/pdf/2203.10940.pdf):
|