Spaces:
Runtime error
Runtime error
pminervini
commited on
Merge branch 'main' of https://huggingface.co/spaces/pminervini/hallucinations-leaderboard into main
Browse files
src/backend/tasks/selfcheckgpt/README.md
CHANGED
@@ -1,3 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
|
2 |
|
3 |
In order to run selfcheckgpt evaluation, these dependencies should be installed.
|
|
|
1 |
+
# Task-name
|
2 |
+
|
3 |
+
### Paper
|
4 |
+
|
5 |
+
Title: `SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models`
|
6 |
+
|
7 |
+
Abstract: `Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules. In this work, we propose "SelfCheckGPT", a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database. SelfCheckGPT leverages the simple idea that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages. We demonstrate that SelfCheckGPT can: i) detect non-factual and factual sentences; and ii) rank passages in terms of factuality. We compare our approach to several baselines and show that our approach has considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.`
|
8 |
+
|
9 |
+
`task.py` in this folder uses the original
|
10 |
+
|
11 |
+
Homepage: [selfcheckgpt](https://github.com/potsawee/selfcheckgpt)
|
12 |
+
|
13 |
+
|
14 |
+
### Citation
|
15 |
+
|
16 |
+
```
|
17 |
+
@article{manakul2023selfcheckgpt,
|
18 |
+
title={Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models},
|
19 |
+
author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF},
|
20 |
+
journal={arXiv preprint arXiv:2303.08896},
|
21 |
+
year={2023}
|
22 |
+
}
|
23 |
+
```
|
24 |
+
|
25 |
+
#### Tasks
|
26 |
+
|
27 |
+
* `selfcheckgpt`: This task uses generative models to generate wikipedia passage based on given starting topics/words. Then generated passages are messured by [selfcheckgpt](https://github.com/potsawee/selfcheckgpt). The default metric is `SelfCheckNgram`, which does not need GPU. Other metrics are `SelfCheckBERTScore`, `SelfCheckMQAG` and `SelfCheckNLI`, which are model-based scores. You can change the metric by changing the enviornment variables.
|
28 |
+
|
29 |
+
The results `"avg-selfcheckgpt` and `max-selfcheckgpt` is the average and max sentences' `selfcheckgpt` score for the generated passage(with temperature=0.0). The score is lower and it is less likely to be hallucination.
|
30 |
+
```
|
31 |
+
export SELFCHECKGPTTYPE=SelfCheckBERTScore #SelfCheckMQAG, SelfCheckNLI
|
32 |
+
```
|
33 |
+
|
34 |
+
Since model-based metric are slow when they are running in cpu, you can change the running device to gpu by:
|
35 |
+
```
|
36 |
+
export SELFCHECKGPTDEVICE=cuda
|
37 |
+
```
|
38 |
+
#### Dependencies for sucessful running
|
39 |
+
```
|
40 |
+
pip install spacy
|
41 |
+
pip install selfcheckgpt
|
42 |
+
python -m spacy download en
|
43 |
+
```
|
44 |
+
### Checklist
|
45 |
+
|
46 |
+
For adding novel benchmarks/datasets to the library:
|
47 |
+
* [ ] Is the task an existing benchmark in the literature?
|
48 |
+
* [x] Have you referenced the original paper that introduced the task?
|
49 |
+
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
|
50 |
+
|
51 |
+
|
52 |
+
If other tasks on this dataset are already supported:
|
53 |
+
* [x] Is the "Main" variant of this task clearly denoted?
|
54 |
+
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
|
55 |
+
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
|
56 |
+
|
57 |
+
|
58 |
+
|
59 |
+
|
60 |
+
|
61 |
+
|
62 |
+
|
63 |
+
|
64 |
# SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
|
65 |
|
66 |
In order to run selfcheckgpt evaluation, these dependencies should be installed.
|
src/backend/tasks/selfcheckgpt/task.py
CHANGED
@@ -17,14 +17,14 @@ class SelfCheckGpt(Task):
|
|
17 |
VERSION = 0.0
|
18 |
DATASET_PATH = "potsawee/wiki_bio_gpt3_hallucination"
|
19 |
DATASET_NAME = None
|
20 |
-
|
21 |
def __init__(self, data_dir=None, cache_dir=None, download_mode=None, config=None):
|
22 |
super().__init__(data_dir=data_dir, cache_dir=cache_dir, download_mode=download_mode, config=config)
|
23 |
self.generation_kwargs = {"temperature": 0.0, "do_sample": False}
|
24 |
self.generation_kwargs_sampling_number = 5 # the number of sampling for self-consistence
|
25 |
self.generation_kwargs_sampling = {"temperature": 1.0, "do_sample": False}
|
26 |
|
27 |
-
self.selfcheckgpt_type = os.environ.get('SELFCHECKGPTTYPE', '
|
28 |
self.selfcheckgpt_device = os.environ.get('SELFCHECKGPTDEVICE', DEVICE)
|
29 |
self.selfcheckgpt_nlp = spacy.load("en_core_web_sm")
|
30 |
|
@@ -92,12 +92,19 @@ class SelfCheckGpt(Task):
|
|
92 |
elif self.selfcheckgpt_type == 'SelfCheckBERTScore':
|
93 |
selfcheckgpt_scores = self.selfcheckgpt.predict(sentences=sentences, sampled_passages=other_responses)
|
94 |
elif self.selfcheckgpt_type == 'SelfCheckMQAG':
|
95 |
-
selfcheckgpt_scores = self.selfcheckgpt.predict(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
elif self.selfcheckgpt_type == 'SelfCheckNLI':
|
97 |
-
selfcheckgpt_scores = self.selfcheckgpt.predict(
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
|
102 |
selfcheckgpt_scores_avg = sum(selfcheckgpt_scores) / len(selfcheckgpt_scores) if len(selfcheckgpt_scores) > 0 else 0
|
103 |
selfcheckgpt_scores_max = max(selfcheckgpt_scores)
|
|
|
17 |
VERSION = 0.0
|
18 |
DATASET_PATH = "potsawee/wiki_bio_gpt3_hallucination"
|
19 |
DATASET_NAME = None
|
20 |
+
OUTPUT_TYPE = 'generate_until'
|
21 |
def __init__(self, data_dir=None, cache_dir=None, download_mode=None, config=None):
|
22 |
super().__init__(data_dir=data_dir, cache_dir=cache_dir, download_mode=download_mode, config=config)
|
23 |
self.generation_kwargs = {"temperature": 0.0, "do_sample": False}
|
24 |
self.generation_kwargs_sampling_number = 5 # the number of sampling for self-consistence
|
25 |
self.generation_kwargs_sampling = {"temperature": 1.0, "do_sample": False}
|
26 |
|
27 |
+
self.selfcheckgpt_type = os.environ.get('SELFCHECKGPTTYPE', 'SelfCheckNLI')
|
28 |
self.selfcheckgpt_device = os.environ.get('SELFCHECKGPTDEVICE', DEVICE)
|
29 |
self.selfcheckgpt_nlp = spacy.load("en_core_web_sm")
|
30 |
|
|
|
92 |
elif self.selfcheckgpt_type == 'SelfCheckBERTScore':
|
93 |
selfcheckgpt_scores = self.selfcheckgpt.predict(sentences=sentences, sampled_passages=other_responses)
|
94 |
elif self.selfcheckgpt_type == 'SelfCheckMQAG':
|
95 |
+
selfcheckgpt_scores = self.selfcheckgpt.predict(
|
96 |
+
sentences = sentences,
|
97 |
+
passage = response_temperature_0,
|
98 |
+
sampled_passages = other_responses,
|
99 |
+
num_questions_per_sent = 5, # number of questions to be drawn
|
100 |
+
scoring_method = 'bayes_with_alpha', # options = 'counting', 'bayes', 'bayes_with_alpha'
|
101 |
+
beta1 = 0.8, beta2 = 0.8, # additional params depending on scoring_method
|
102 |
+
)
|
103 |
elif self.selfcheckgpt_type == 'SelfCheckNLI':
|
104 |
+
selfcheckgpt_scores = self.selfcheckgpt.predict(
|
105 |
+
sentences = sentences,
|
106 |
+
sampled_passages = other_responses,
|
107 |
+
)
|
108 |
|
109 |
selfcheckgpt_scores_avg = sum(selfcheckgpt_scores) / len(selfcheckgpt_scores) if len(selfcheckgpt_scores) > 0 else 0
|
110 |
selfcheckgpt_scores_max = max(selfcheckgpt_scores)
|