Michael Saxon commited on
Commit
eba4079
1 Parent(s): 2a708e1

updated with the nice app interface and information

Browse files
Files changed (2) hide show
  1. app.py +69 -1
  2. description.md +59 -0
app.py CHANGED
@@ -1,5 +1,73 @@
1
  import evaluate
2
- from evaluate.utils import launch_gradio_widget
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  module = evaluate.load("xu1998hz/sescore")
5
  launch_gradio_widget(module)
 
1
  import evaluate
2
+ import sys
3
+ from pathlib import Path
4
+ from evaluate.utils import infer_gradio_input_types, json_to_string_type, parse_readme, parse_gradio_data, parse_test_cases
5
+
6
+
7
+ def launch_gradio_widget(metric):
8
+ """Launches `metric` widget with Gradio."""
9
+
10
+ try:
11
+ import gradio as gr
12
+ except ImportError as error:
13
+ logger.error("To create a metric widget with Gradio make sure gradio is installed.")
14
+ raise error
15
+
16
+ local_path = Path(sys.path[0])
17
+ # if there are several input types, use first as default.
18
+ if isinstance(metric.features, list):
19
+ (feature_names, feature_types) = zip(*metric.features[0].items())
20
+ else:
21
+ (feature_names, feature_types) = zip(*metric.features.items())
22
+ gradio_input_types = infer_gradio_input_types(feature_types)
23
+
24
+ def compute(data):
25
+ return metric.compute(**parse_gradio_data(data, gradio_input_types))
26
+
27
+ header_html = '''<div style="max-width:800px; margin:auto; float:center; margin-top:0; margin-bottom:0; padding:0;">
28
+ <img src="https://huggingface.co/spaces/xu1998hz/sescore/resolve/main/img/logo_sescore.png" style="margin:0; padding:0; margin-top:-10px; margin-bottom:-50px;">
29
+ </div>
30
+ <h2 style='margin-top: 5pt; padding-top:10pt;'>About <i>SEScore</i></h2>
31
+
32
+ <p><b>SEScore</b> is a reference-based text-generation evaluation metric that requires no pre-human-annotated error data,
33
+ described in our paper <a href="https://arxiv.org/abs/2210.05035"><b>"Not All Errors are Equal: Learning Text Generation Metrics using
34
+ Stratified Error Synthesis"</b></a> from EMNLP 2022.</p>
35
+
36
+ <p>Its effectiveness over prior methods like BLEU and COMET has been demonstrated on a diverse set of language generation tasks, including
37
+ translation, captioning, and web text generation. <a href="https://twitter.com/LChoshen/status/1580136005654700033">Readers have even described SEScore as "one unsupervised evaluation to rule them all"</a>
38
+ and we are very excited to share it with you!</p>
39
+
40
+ <h2 style='margin-top: 10pt; padding-top:0;'>Try it yourself!</h2>
41
+ <p>Provide sample (gold) reference text and (model output) predicted text below and see how SEScore rates them! It is most performant
42
+ in a relative ranking setting, so in general <b>it will rank better predictions higher than worse ones.</b> Providing useful
43
+ absolute numbers based on SEScore is an ongoing direction of investigation.</p>
44
+ '''.replace('\n',' ')
45
+
46
+
47
+ tail_markdown = parse_readme(local_path / "description.md")
48
+
49
+
50
+ iface = gr.Interface(
51
+ fn=compute,
52
+ inputs=gr.inputs.Dataframe(
53
+ headers=feature_names,
54
+ col_count=len(feature_names),
55
+ row_count=2,
56
+ datatype=json_to_string_type(gradio_input_types),
57
+ ),
58
+ outputs=gr.outputs.Textbox(label=metric.name),
59
+ description=header_html,
60
+ #title=f"SEScore Metric Usage Example",
61
+ article=tail_markdown,
62
+ # TODO: load test cases and use them to populate examples
63
+ # examples=[parse_test_cases(test_cases, feature_names, gradio_input_types)]
64
+ )
65
+
66
+ print(dir(iface))
67
+
68
+ iface.launch()
69
+
70
+
71
 
72
  module = evaluate.load("xu1998hz/sescore")
73
  launch_gradio_widget(module)
description.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Installation and usage
2
+
3
+ ```bash
4
+ pip install -r requirements.txt
5
+ ```
6
+
7
+ Minimal example (evaluating English text generation)
8
+ ```python
9
+ import evaluate
10
+ sescore = evaluate.load("xu1998hz/sescore")
11
+ score = sescore.compute(
12
+ references=['sescore is a simple but effective next-generation text evaluation metric'],
13
+ predictions=['sescore is simple effective text evaluation metric for next generation']
14
+ )
15
+ ```
16
+
17
+ *SEScore* compares a list of references (gold translation/generated output examples) with a same-length list of candidate generated samples. Currently, the output range is learned and scores are most useful in relative ranking scenarios rather than absolute comparisons. We are producing a series of rescaling options to make absolute SEScore-based scaling more effective.
18
+
19
+
20
+ ### Available pre-trained models
21
+
22
+ Currently, the following language/model pairs are available:
23
+
24
+ | Language | pretrained data | pretrained model link |
25
+ |----------|-----------------|-----------------------|
26
+ | English | MT | [xu1998hz/sescore_english_mt](https://huggingface.co/xu1998hz/sescore_english_mt) |
27
+ | German | MT | [xu1998hz/sescore_german_mt](https://huggingface.co/xu1998hz/sescore_german_mt) |
28
+ | English | webNLG17 | [xu1998hz/sescore_english_webnlg17](https://huggingface.co/xu1998hz/sescore_english_webnlg17) |
29
+ | English | CoCo captions | [xu1998hz/sescore_english_coco](https://huggingface.co/xu1998hz/sescore_english_coco) |
30
+
31
+
32
+ Please contact repo maintainer Wenda Xu to add your models!
33
+
34
+ ## Limitations
35
+
36
+ *SEScore* is trained on synthetic data in-domain.
37
+ Although this data is generated to simulate user-relevant errors like deletion and spurious insertion, it may be limited in its ability to simulate humanlike errors.
38
+ Model applicability is domain-specific (e.g., CoCo caption-trained model will be better for captioning than MT-trained).
39
+
40
+ We are in the process of producing and benchmarking general language-level *SEScore* variants.
41
+
42
+ ## Citation
43
+
44
+ If you find our work useful, please cite the following:
45
+
46
+ ```bibtex
47
+ @inproceedings{xu-etal-2022-not,
48
+ title={Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis},
49
+ author={Xu, Wenda and Tuan, Yi-lin and Lu, Yujie and Saxon, Michael and Li, Lei and Wang, William Yang},
50
+ booktitle ={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
51
+ month={dec},
52
+ year={2022},
53
+ url={https://arxiv.org/abs/2210.05035}
54
+ }
55
+ ```
56
+
57
+ ## Acknowledgements
58
+
59
+ The work of the [COMET](https://github.com/Unbabel/COMET) maintainers at [Unbabel](https://duckduckgo.com/?t=ffab&q=unbabel&ia=web) has been instrumental in producing SEScore.