lvwerra HF staff commited on
Commit
981697b
·
1 Parent(s): aa5e3a7

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +94 -4
  2. app.py +6 -0
  3. bleurt.py +125 -0
  4. requirements.txt +4 -0
README.md CHANGED
@@ -1,12 +1,102 @@
1
  ---
2
- title: Bleurt
3
- emoji: 🐨
4
- colorFrom: green
5
  colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: BLEURT
3
+ emoji: 🤗
4
+ colorFrom: blue
5
  colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for BLEURT
16
+
17
+
18
+ ## Metric Description
19
+ BLEURT is a learned evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model [Devlin et al. 2018](https://arxiv.org/abs/1810.04805), employing another pre-training phrase using synthetic data, and finally trained on WMT human annotations.
20
+
21
+ It is possible to run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better).
22
+ See the project's [README](https://github.com/google-research/bleurt#readme) for more information.
23
+
24
+ ## Intended Uses
25
+ BLEURT is intended to be used for evaluating text produced by language models.
26
+
27
+ ## How to Use
28
+
29
+ This metric takes as input lists of predicted sentences and reference sentences:
30
+
31
+ ```python
32
+ >>> predictions = ["hello there", "general kenobi"]
33
+ >>> references = ["hello there", "general kenobi"]
34
+ >>> bleurt = load("bleurt", type="metric")
35
+ >>> results = bleurt.compute(predictions=predictions, references=references)
36
+ ```
37
+
38
+ ### Inputs
39
+ - **predictions** (`list` of `str`s): List of generated sentences to score.
40
+ - **references** (`list` of `str`s): List of references to compare to.
41
+ - **checkpoint** (`str`): BLEURT checkpoint. Will default to `BLEURT-tiny` if not specified. Other models that can be chosen are: `"bleurt-tiny-128"`, `"bleurt-tiny-512"`, `"bleurt-base-128"`, `"bleurt-base-512"`, `"bleurt-large-128"`, `"bleurt-large-512"`, `"BLEURT-20-D3"`, `"BLEURT-20-D6"`, `"BLEURT-20-D12"` and `"BLEURT-20"`.
42
+
43
+ ### Output Values
44
+ - **scores** : a `list` of scores, one per prediction.
45
+
46
+ Output Example:
47
+ ```python
48
+ {'scores': [1.0295498371124268, 1.0445425510406494]}
49
+
50
+ ```
51
+
52
+ BLEURT's output is always a number between 0 and (approximately 1). This value indicates how similar the generated text is to the reference texts, with values closer to 1 representing more similar texts.
53
+
54
+ #### Values from Popular Papers
55
+
56
+ The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) reported that the metric is better correlated with human judgment compared to similar metrics such as BERT and BERTscore.
57
+
58
+ BLEURT is used to compare models across different asks (e.g. (Table to text generation)[https://paperswithcode.com/sota/table-to-text-generation-on-dart?metric=BLEURT]).
59
+
60
+ ### Examples
61
+
62
+ Example with the default model:
63
+ ```python
64
+ >>> predictions = ["hello there", "general kenobi"]
65
+ >>> references = ["hello there", "general kenobi"]
66
+ >>> bleurt = load("bleurt", type="metric")
67
+ >>> results = bleurt.compute(predictions=predictions, references=references)
68
+ >>> print(results)
69
+ {'scores': [1.0295498371124268, 1.0445425510406494]}
70
+ ```
71
+
72
+ Example with the `"bleurt-base-128"` model checkpoint:
73
+ ```python
74
+ >>> predictions = ["hello there", "general kenobi"]
75
+ >>> references = ["hello there", "general kenobi"]
76
+ >>> bleurt = load("bleurt", type="metric", checkpoint="bleurt-base-128")
77
+ >>> results = bleurt.compute(predictions=predictions, references=references)
78
+ >>> print(results)
79
+ {'scores': [1.0295498371124268, 1.0445425510406494]}
80
+ ```
81
+
82
+ ## Limitations and Bias
83
+ The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) showed that BLEURT correlates well with human judgment, but this depends on the model and language pair selected.
84
+
85
+ Furthermore, currently BLEURT only supports English-language scoring, given that it leverages models trained on English corpora. It may also reflect, to a certain extent, biases and correlations that were present in the model training data.
86
+
87
+ Finally, calculating the BLEURT metric involves downloading the BLEURT model that is used to compute the score, which can take a significant amount of time depending on the model chosen. Starting with the default model, `bleurt-tiny`, and testing out larger models if necessary can be a useful approach if memory or internet speed is an issue.
88
+
89
+
90
+ ## Citation
91
+ ```bibtex
92
+ @inproceedings{bleurt,
93
+ title={BLEURT: Learning Robust Metrics for Text Generation},
94
+ author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
95
+ booktitle={ACL},
96
+ year={2020},
97
+ url={https://arxiv.org/abs/2004.04696}
98
+ }
99
+ ```
100
+
101
+ ## Further References
102
+ - The original [BLEURT GitHub repo](https://github.com/google-research/bleurt/)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("bleurt")
6
+ launch_gradio_widget(module)
bleurt.py ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ BLEURT metric. """
15
+
16
+ import os
17
+
18
+ import datasets
19
+ from bleurt import score # From: git+https://github.com/google-research/bleurt.git
20
+
21
+ import evaluate
22
+
23
+
24
+ logger = evaluate.logging.get_logger(__name__)
25
+
26
+
27
+ _CITATION = """\
28
+ @inproceedings{bleurt,
29
+ title={BLEURT: Learning Robust Metrics for Text Generation},
30
+ author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
31
+ booktitle={ACL},
32
+ year={2020},
33
+ url={https://arxiv.org/abs/2004.04696}
34
+ }
35
+ """
36
+
37
+ _DESCRIPTION = """\
38
+ BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018)
39
+ and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune
40
+ it for your specific application (the latter is expected to perform better).
41
+
42
+ See the project's README at https://github.com/google-research/bleurt#readme for more information.
43
+ """
44
+
45
+ _KWARGS_DESCRIPTION = """
46
+ BLEURT score.
47
+
48
+ Args:
49
+ `predictions` (list of str): prediction/candidate sentences
50
+ `references` (list of str): reference sentences
51
+ `checkpoint` BLEURT checkpoint. Will default to BLEURT-tiny if None.
52
+
53
+ Returns:
54
+ 'scores': List of scores.
55
+ Examples:
56
+
57
+ >>> predictions = ["hello there", "general kenobi"]
58
+ >>> references = ["hello there", "general kenobi"]
59
+ >>> bleurt = evaluate.load("bleurt")
60
+ >>> results = bleurt.compute(predictions=predictions, references=references)
61
+ >>> print([round(v, 2) for v in results["scores"]])
62
+ [1.03, 1.04]
63
+ """
64
+
65
+ CHECKPOINT_URLS = {
66
+ "bleurt-tiny-128": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-128.zip",
67
+ "bleurt-tiny-512": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-512.zip",
68
+ "bleurt-base-128": "https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip",
69
+ "bleurt-base-512": "https://storage.googleapis.com/bleurt-oss/bleurt-base-512.zip",
70
+ "bleurt-large-128": "https://storage.googleapis.com/bleurt-oss/bleurt-large-128.zip",
71
+ "bleurt-large-512": "https://storage.googleapis.com/bleurt-oss/bleurt-large-512.zip",
72
+ "BLEURT-20-D3": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D3.zip",
73
+ "BLEURT-20-D6": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D6.zip",
74
+ "BLEURT-20-D12": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D12.zip",
75
+ "BLEURT-20": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip",
76
+ }
77
+
78
+
79
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
80
+ class BLEURT(evaluate.EvaluationModule):
81
+ def _info(self):
82
+
83
+ return evaluate.EvaluationModuleInfo(
84
+ description=_DESCRIPTION,
85
+ citation=_CITATION,
86
+ homepage="https://github.com/google-research/bleurt",
87
+ inputs_description=_KWARGS_DESCRIPTION,
88
+ features=datasets.Features(
89
+ {
90
+ "predictions": datasets.Value("string", id="sequence"),
91
+ "references": datasets.Value("string", id="sequence"),
92
+ }
93
+ ),
94
+ codebase_urls=["https://github.com/google-research/bleurt"],
95
+ reference_urls=["https://github.com/google-research/bleurt", "https://arxiv.org/abs/2004.04696"],
96
+ )
97
+
98
+ def _download_and_prepare(self, dl_manager):
99
+
100
+ # check that config name specifies a valid BLEURT model
101
+ if self.config_name == "default":
102
+ logger.warning(
103
+ "Using default BLEURT-Base checkpoint for sequence maximum length 128. "
104
+ "You can use a bigger model for better results with e.g.: evaluate.load('bleurt', 'bleurt-large-512')."
105
+ )
106
+ self.config_name = "bleurt-base-128"
107
+
108
+ if self.config_name.lower() in CHECKPOINT_URLS:
109
+ checkpoint_name = self.config_name.lower()
110
+
111
+ elif self.config_name.upper() in CHECKPOINT_URLS:
112
+ checkpoint_name = self.config_name.upper()
113
+
114
+ else:
115
+ raise KeyError(
116
+ f"{self.config_name} model not found. You should supply the name of a model checkpoint for bleurt in {CHECKPOINT_URLS.keys()}"
117
+ )
118
+
119
+ # download the model checkpoint specified by self.config_name and set up the scorer
120
+ model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[checkpoint_name])
121
+ self.scorer = score.BleurtScorer(os.path.join(model_path, checkpoint_name))
122
+
123
+ def _compute(self, predictions, references):
124
+ scores = self.scorer.score(references=references, candidates=predictions)
125
+ return {"scores": scores}
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ git+https://github.com/google-research/bleurt.git