ZoneTwelve commited on
Commit
3ef9b51
·
1 Parent(s): bf8d6f5

Update README.md and WARNING the user I\'m not the Author.

Browse files
Files changed (1) hide show
  1. README.md +115 -3
README.md CHANGED
@@ -1,3 +1,115 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/63ea0de943d976de6e4e54fb/-zXQ3G2iKCCAq6x8gPGm7.png" width="300" class="left"><img src="https://cdn-uploads.huggingface.co/production/uploads/63ea0de943d976de6e4e54fb/r1vY_i4DmL5shXAm_CMs9.png" width="400" class="center">
6
+
7
+ This is the Repo for the paper: [BARTScore: Evaluating Generated Text as Text Generation](https://arxiv.org/abs/2106.11520)
8
+
9
+ ## Updates
10
+ - 2021.09.29 Paper gets accepted to NeurIPS 2021 :tada:
11
+ - 2021.08.18 Release code
12
+ - 2021.06.28 Release online evaluation [Demo](http://bartscore.sh/)
13
+ - 2021.06.25 Release online Explainable Leaderboard for [Meta-evaluation](http://explainaboard.nlpedia.ai/leaderboard/task-meval/index.php)
14
+ - 2021.06.22 Code will be released soon
15
+
16
+ ## Background
17
+ There is a recent trend that leverages neural models for automated evaluation in different ways, as shown in Fig.1.
18
+
19
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/63ea0de943d976de6e4e54fb/jfRv5wmLud1uYivH4ZG6c.png" width=650 class="left">
20
+
21
+ (a) **Evaluation as matching task.** Unsupervised matching metrics aim to measure the semantic equivalence between the reference and hypothesis by using a token-level matching functions in distributed representation space (e.g. BERT) or discrete string space (e.g. ROUGE).
22
+
23
+ (b) **Evaluation as regression task.** Regression-based metrics (e.g. BLEURT) introduce a parameterized regression layer, which would be learned in a supervised fashion to accurately predict human judgments.
24
+
25
+ (c) **Evaluation as ranking task.** Ranking-based metrics (e.g. COMET) aim to learn a scoring function that assigns a higher score to better hypotheses than to worse ones.
26
+
27
+ (d) **Evaluation as generation task.** In this work, we formulate evaluating generated text as a text generation task from pre-trained language models.
28
+
29
+ ## Our Work
30
+ Basic requirements for all the libraries are in the `requirements.txt.`
31
+
32
+ ### Direct use
33
+ Our trained BARTScore (on ParaBank2) can be downloaded [here](https://drive.google.com/file/d/1_7JfF7KOInb7ZrxKHIigTMR4ChVET01m/view?usp=sharing). Example usage is shown below.
34
+
35
+ ```python
36
+ # To use the CNNDM version BARTScore
37
+ >>> from bart_score import BARTScorer
38
+ >>> bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
39
+ >>> bart_scorer.score(['This is interesting.'], ['This is fun.'], batch_size=4) # generation scores from the first list of texts to the second list of texts.
40
+ [out]
41
+ [-2.510652780532837]
42
+
43
+ # To use our trained ParaBank version BARTScore
44
+ >>> from bart_score import BARTScorer
45
+ >>> bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
46
+ >>> bart_scorer.load(path='bart.pth')
47
+ >>> bart_scorer.score(['This is interesting.'], ['This is fun.'], batch_size=4)
48
+ [out]
49
+ [-2.336203098297119]
50
+ ```
51
+
52
+ We also provide multi-reference support. Please make sure you have the same number of references for each test sample. The usage is shown below.
53
+ ```python
54
+ >>> from bart_score import BARTScorer
55
+ >>> bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
56
+ >>> srcs = ["I'm super happy today.", "This is a good idea."]
57
+ >>> tgts = [["I feel good today.", "I feel sad today."], ["Not bad.", "Sounds like a good idea."]] # List[List of references for each test sample]
58
+ >>> bart_scorer.multi_ref_score(srcs, tgts, agg="max", batch_size=4) # agg means aggregation, can be mean or max
59
+ [out]
60
+ [-2.5008113384246826, -1.626236081123352]
61
+ ```
62
+
63
+
64
+ ### Reproduce
65
+ To reproduce the results for each task, please see the `README.md` in each folder: `D2T` (data-to-text), `SUM` (summarization), `WMT` (machine translation). Once you get the scored pickle file in the right path (in each dataset folder), you can use them to conduct analysis.
66
+
67
+ For analysis, we provide `SUMStat`, `D2TStat` and `WMTStat` in `analysis.py` that can conveniently run analysis. An example of using `SUMStat` is shown below. Detailed usage can refer to `analysis.ipynb`.
68
+
69
+ ```python
70
+ >>> from analysis import SUMStat
71
+ >>> stat = SUMStat('SUM/REALSumm/final_p.pkl')
72
+ >>> stat.evaluate_summary('litepyramid_recall')
73
+
74
+ [out]
75
+ Human metric: litepyramid_recall
76
+ metric spearman kendalltau
77
+ ------------------------------------------------- ---------- ------------
78
+ rouge1_r 0.497526 0.407974
79
+ bart_score_cnn_hypo_ref_de_id est 0.49539 0.392728
80
+ bart_score_cnn_hypo_ref_de_Videlicet 0.491011 0.388237
81
+ ...
82
+ ```
83
+
84
+ ### Train your custom BARTScore
85
+ If you want to train your custom BARTScore with paired data, we provide the scripts and detailed instructions in the `train` folder. Once you got your trained model (for example, `my_bartscore` folder). You can use your custom BARTScore as shown below.
86
+
87
+ ```python
88
+ >>> from bart_score import BARTScorer
89
+ >>> bart_scorer = BARTScorer(device='cuda:0', checkpoint='my_bartscore')
90
+ >>> bart_scorer.score(['This is interesting.'], ['This is fun.'])
91
+ ```
92
+
93
+
94
+ ### Notes on use
95
+ Since we are using the average log-likelihood for target tokens, the calculated scores will be smaller than 0 (the probability is between 0 and 1, so the log of it should be negative). The higher the log-likelihood, the higher the probability.
96
+
97
+ To give an example, if SummaryA gets a score of -1 while SummaryB gets a score of -100, this means that the model thinks SummaryA is better than summaryB.
98
+ ## Bib
99
+ Please cite our work if you find it useful.
100
+ ```
101
+ @inproceedings{NEURIPS2021_e4d2b6e6,
102
+ author = {Yuan, Weizhe and Neubig, Graham and Liu, Pengfei},
103
+ booktitle = {Advances in Neural Information Processing Systems},
104
+ editor = {M. Ranzato and A. Beygelzimer and Y. Dauphin and P.S. Liang and J. Wortman Vaughan},
105
+ pages = {27263--27277},
106
+ publisher = {Curran Associates, Inc.},
107
+ title = {BARTScore: Evaluating Generated Text as Text Generation},
108
+ url = {https://proceedings.neurips.cc/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf},
109
+ volume = {34},
110
+ year = {2021}
111
+ }
112
+ ```
113
+
114
+ WARNING: This isn't the original owner's repository
115
+ [The original repository](https://github.com/neulab/BARTScore)