data_only_hallucination_leaderboard

Runtime error

App Files Files Community

data_only_hallucination_leaderboard / src /display /about.py

aryopg

first draft: add tasks background info

a911aee 11 months ago

raw

history blame

5.21 kB

	from src.display.utils import ModelType

	TITLE = """<h1 align="center" id="space-title">🤗 Open Hallucinations Leaderboard</h1>"""

	INTRODUCTION_TEXT = """
	📐 The 🤗 Open Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs and chatbots.

	🤗 Submit a model for automated evaluation on the 🤗 GPU cluster on the "Submit" page!
	The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details in the "About" page!
	"""

	LLM_BENCHMARKS_TEXT = f"""
	# Context
	As large language models (LLMs) get better at creating believable texts, addressing hallucinations in LLMs becomes increasingly important. In this exciting time where numerous LLMs released every week, it can be challenging to identify the leading model, particularly in terms of their reliability against hallucination. This leaderboard aims to provide a platform where anyone can evaluate the latest LLMs at any time.

	# How it works
	📈 We evaluate the models on 11 hallucination benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
	- <a href="https://aclanthology.org/P19-1612/" target="_blank"> NQ Open </a> .
	- <a href="https://aclanthology.org/P17-1147/" target="_blank"> TriviaQA </a> .
	- <a href="https://aclanthology.org/2022.acl-long.229/" target="_blank"> TruthfulQA MC1 </a> - a benchmark to measure whether a language model is truthful in generating answers to questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. MC1 denotes that there is a single correct label.
	- <a href="https://aclanthology.org/2022.acl-long.229/" target="_blank"> TruthfulQA MC2 </a> - a benchmark to measure whether a language model is truthful in generating answers to questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. MC2 denotes that there can be multiple correct labels.
	- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval QA </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. QA denotes the question answering task.
	- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval Summ </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. Summ denotes the summarisation task.
	- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval Dial </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. Dial denotes the knowledge-grounded dialogue task.
	- <a href="https://aclanthology.org/2020.acl-main.173/" target="_blank"> XSum </a> - a dataset of BBC news articles paired with their single-sentence summaries to evaluate the output of abstractive summarization using a language model.
	- <a href="https://arxiv.org/abs/1704.04368" target="_blank"> CNN/DM </a> - a dataset of CNN and Daily Mail articles paired with their summaries.
	- <a href="https://github.com/inverse-scaling/prize/tree/main" target="_blank"> MemoTrap </a> - a dataset to investigate whether language models could fall into memorization traps. It comprises instructions that prompt the language model to complete a well-known proverb with an ending word that deviates from the commonly used ending (e.g., Write a quote that ends in the word “early”: Better late than ).
	- <a href="https://arxiv.org/abs/2311.07911v1" target="_blank"> IFEval </a> a dataset to evaluate instruction following ability of large language models. There are 500+ prompts with instructions such as "write an article with more than 800 words", "wrap your response with double quotation marks".

	For all these evaluations, a higher score is a better score.

	# Details and logs
	You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the 📄 emoji after the model name

	# Reproducibility
	Hyperparameters: XXX
	Device(s): XXX
	Metrics: XXX
	"""

	FAQ_TEXT = """
	---------------------------
	# FAQ
	XXX
	"""

	EVALUATION_QUEUE_TEXT = """
	XXX
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r"""
	@misc{hallucinations-leaderboard,
	author = {Pasquale Minervini},
	title = {Hallucinations Leaderboard},
	year = {2023},
	publisher = {Hugging Face},
	howpublished = "\url{https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard}"
	}
	"""