Spaces:
Build error
Build error
Michael Saxon
commited on
Commit
•
eba4079
1
Parent(s):
2a708e1
updated with the nice app interface and information
Browse files- app.py +69 -1
- description.md +59 -0
app.py
CHANGED
@@ -1,5 +1,73 @@
|
|
1 |
import evaluate
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
|
4 |
module = evaluate.load("xu1998hz/sescore")
|
5 |
launch_gradio_widget(module)
|
|
|
1 |
import evaluate
|
2 |
+
import sys
|
3 |
+
from pathlib import Path
|
4 |
+
from evaluate.utils import infer_gradio_input_types, json_to_string_type, parse_readme, parse_gradio_data, parse_test_cases
|
5 |
+
|
6 |
+
|
7 |
+
def launch_gradio_widget(metric):
|
8 |
+
"""Launches `metric` widget with Gradio."""
|
9 |
+
|
10 |
+
try:
|
11 |
+
import gradio as gr
|
12 |
+
except ImportError as error:
|
13 |
+
logger.error("To create a metric widget with Gradio make sure gradio is installed.")
|
14 |
+
raise error
|
15 |
+
|
16 |
+
local_path = Path(sys.path[0])
|
17 |
+
# if there are several input types, use first as default.
|
18 |
+
if isinstance(metric.features, list):
|
19 |
+
(feature_names, feature_types) = zip(*metric.features[0].items())
|
20 |
+
else:
|
21 |
+
(feature_names, feature_types) = zip(*metric.features.items())
|
22 |
+
gradio_input_types = infer_gradio_input_types(feature_types)
|
23 |
+
|
24 |
+
def compute(data):
|
25 |
+
return metric.compute(**parse_gradio_data(data, gradio_input_types))
|
26 |
+
|
27 |
+
header_html = '''<div style="max-width:800px; margin:auto; float:center; margin-top:0; margin-bottom:0; padding:0;">
|
28 |
+
<img src="https://huggingface.co/spaces/xu1998hz/sescore/resolve/main/img/logo_sescore.png" style="margin:0; padding:0; margin-top:-10px; margin-bottom:-50px;">
|
29 |
+
</div>
|
30 |
+
<h2 style='margin-top: 5pt; padding-top:10pt;'>About <i>SEScore</i></h2>
|
31 |
+
|
32 |
+
<p><b>SEScore</b> is a reference-based text-generation evaluation metric that requires no pre-human-annotated error data,
|
33 |
+
described in our paper <a href="https://arxiv.org/abs/2210.05035"><b>"Not All Errors are Equal: Learning Text Generation Metrics using
|
34 |
+
Stratified Error Synthesis"</b></a> from EMNLP 2022.</p>
|
35 |
+
|
36 |
+
<p>Its effectiveness over prior methods like BLEU and COMET has been demonstrated on a diverse set of language generation tasks, including
|
37 |
+
translation, captioning, and web text generation. <a href="https://twitter.com/LChoshen/status/1580136005654700033">Readers have even described SEScore as "one unsupervised evaluation to rule them all"</a>
|
38 |
+
and we are very excited to share it with you!</p>
|
39 |
+
|
40 |
+
<h2 style='margin-top: 10pt; padding-top:0;'>Try it yourself!</h2>
|
41 |
+
<p>Provide sample (gold) reference text and (model output) predicted text below and see how SEScore rates them! It is most performant
|
42 |
+
in a relative ranking setting, so in general <b>it will rank better predictions higher than worse ones.</b> Providing useful
|
43 |
+
absolute numbers based on SEScore is an ongoing direction of investigation.</p>
|
44 |
+
'''.replace('\n',' ')
|
45 |
+
|
46 |
+
|
47 |
+
tail_markdown = parse_readme(local_path / "description.md")
|
48 |
+
|
49 |
+
|
50 |
+
iface = gr.Interface(
|
51 |
+
fn=compute,
|
52 |
+
inputs=gr.inputs.Dataframe(
|
53 |
+
headers=feature_names,
|
54 |
+
col_count=len(feature_names),
|
55 |
+
row_count=2,
|
56 |
+
datatype=json_to_string_type(gradio_input_types),
|
57 |
+
),
|
58 |
+
outputs=gr.outputs.Textbox(label=metric.name),
|
59 |
+
description=header_html,
|
60 |
+
#title=f"SEScore Metric Usage Example",
|
61 |
+
article=tail_markdown,
|
62 |
+
# TODO: load test cases and use them to populate examples
|
63 |
+
# examples=[parse_test_cases(test_cases, feature_names, gradio_input_types)]
|
64 |
+
)
|
65 |
+
|
66 |
+
print(dir(iface))
|
67 |
+
|
68 |
+
iface.launch()
|
69 |
+
|
70 |
+
|
71 |
|
72 |
module = evaluate.load("xu1998hz/sescore")
|
73 |
launch_gradio_widget(module)
|
description.md
ADDED
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Installation and usage
|
2 |
+
|
3 |
+
```bash
|
4 |
+
pip install -r requirements.txt
|
5 |
+
```
|
6 |
+
|
7 |
+
Minimal example (evaluating English text generation)
|
8 |
+
```python
|
9 |
+
import evaluate
|
10 |
+
sescore = evaluate.load("xu1998hz/sescore")
|
11 |
+
score = sescore.compute(
|
12 |
+
references=['sescore is a simple but effective next-generation text evaluation metric'],
|
13 |
+
predictions=['sescore is simple effective text evaluation metric for next generation']
|
14 |
+
)
|
15 |
+
```
|
16 |
+
|
17 |
+
*SEScore* compares a list of references (gold translation/generated output examples) with a same-length list of candidate generated samples. Currently, the output range is learned and scores are most useful in relative ranking scenarios rather than absolute comparisons. We are producing a series of rescaling options to make absolute SEScore-based scaling more effective.
|
18 |
+
|
19 |
+
|
20 |
+
### Available pre-trained models
|
21 |
+
|
22 |
+
Currently, the following language/model pairs are available:
|
23 |
+
|
24 |
+
| Language | pretrained data | pretrained model link |
|
25 |
+
|----------|-----------------|-----------------------|
|
26 |
+
| English | MT | [xu1998hz/sescore_english_mt](https://huggingface.co/xu1998hz/sescore_english_mt) |
|
27 |
+
| German | MT | [xu1998hz/sescore_german_mt](https://huggingface.co/xu1998hz/sescore_german_mt) |
|
28 |
+
| English | webNLG17 | [xu1998hz/sescore_english_webnlg17](https://huggingface.co/xu1998hz/sescore_english_webnlg17) |
|
29 |
+
| English | CoCo captions | [xu1998hz/sescore_english_coco](https://huggingface.co/xu1998hz/sescore_english_coco) |
|
30 |
+
|
31 |
+
|
32 |
+
Please contact repo maintainer Wenda Xu to add your models!
|
33 |
+
|
34 |
+
## Limitations
|
35 |
+
|
36 |
+
*SEScore* is trained on synthetic data in-domain.
|
37 |
+
Although this data is generated to simulate user-relevant errors like deletion and spurious insertion, it may be limited in its ability to simulate humanlike errors.
|
38 |
+
Model applicability is domain-specific (e.g., CoCo caption-trained model will be better for captioning than MT-trained).
|
39 |
+
|
40 |
+
We are in the process of producing and benchmarking general language-level *SEScore* variants.
|
41 |
+
|
42 |
+
## Citation
|
43 |
+
|
44 |
+
If you find our work useful, please cite the following:
|
45 |
+
|
46 |
+
```bibtex
|
47 |
+
@inproceedings{xu-etal-2022-not,
|
48 |
+
title={Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis},
|
49 |
+
author={Xu, Wenda and Tuan, Yi-lin and Lu, Yujie and Saxon, Michael and Li, Lei and Wang, William Yang},
|
50 |
+
booktitle ={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
|
51 |
+
month={dec},
|
52 |
+
year={2022},
|
53 |
+
url={https://arxiv.org/abs/2210.05035}
|
54 |
+
}
|
55 |
+
```
|
56 |
+
|
57 |
+
## Acknowledgements
|
58 |
+
|
59 |
+
The work of the [COMET](https://github.com/Unbabel/COMET) maintainers at [Unbabel](https://duckduckgo.com/?t=ffab&q=unbabel&ia=web) has been instrumental in producing SEScore.
|