URIAL-Bench / constants.py
yuchenlin's picture
references
7f1ee50
raw
history blame
3.9 kB
from pathlib import Path
# Directory where request by models are stored
DIR_OUTPUT_REQUESTS = Path("requested_models")
EVAL_REQUESTS_PATH = Path("eval_requests")
##########################
# Text definitions #
##########################
banner_url = "https://huggingface.co/spaces/WildEval/WildBench-Leaderboard/resolve/main/%E2%80%8Eleaderboard_logo_v2.png" # the same repo here.
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'
TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> URIAL Bench </b> </body> </html>"
INTRODUCTION_TEXT= """
# URIAL Bench (Evaluating Base LLMs with URIAL on MT-Bench)
[πŸ›œ Website](https://allenai.github.io/re-align/index.html) | [πŸ’» GitHub](https://github.com/Re-Align/URIAL) | [πŸ“– Paper](https://arxiv.org/abs/2312.01552) | [🐦 Tweet 1](https://x.com/billyuchenlin/status/1759541978881311125?s=20) | [🐦 Tweet 2](https://x.com/billyuchenlin/status/1762206077566013505?s=20)
> URIAL Bench tests the capacity of base LLMs for alignment without introducing the factors of fine-tuning (learning rate, data, etc.), which are hard to control for fair comparisons.
Specifically, we use [URIAL](https://github.com/Re-Align/URIAL/tree/main/run_scripts/mt-bench#run-urial-inference) to align a base LLM, and evaluate its performance on MT-Bench.
- [πŸ‘ URIAL](https://arxiv.org/abs/2312.01552) uses K=3 constant [examples](https://github.com/Re-Align/URIAL/blob/main/urial_prompts/inst_1k_v4.help.txt.md) to align BASE LLMs with in-context learning.
- [πŸ“Š MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) is a small, curated benchmark with two turns of instruction following tasks in 10 domains.
"""
CITATION_TEXT = """@inproceedings{
Lin2024ReAlign,
title={The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning},
author={Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi},
booktitle={International Conference on Learning Representations},
year={2024},
url={https://arxiv.org/abs/2312.01552}
}
@misc{zheng2023judging,
title={Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric P. Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
year={2023},
eprint={2306.05685},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
"""
METRICS_TAB_TEXT = """
Here you will find details about the different metrics reported in our leaderboard.
## Metrics
🎯 Win Rate and Elo Ratings are popular metrics for evaluating LLMs general capabilities by comparing the with a strong reference model. [WIP]
### Win Rate vs. ChatGPT
[WIP]
```
Example:
```
### Elo Rating
[WIP]
```
Example:
```
## How to reproduce our results
The WildBench Leaderboard will be a continued effort to benchmark open source/access LLMs.
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations.
For more details head over to our repo at: https://github.com/WildEval/WildBench-Leaderboard
P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! β™₯️
## Benchmark datasets
| Dataset | Domain | Source | size | License |
|-----------------------------------------------------------------|--------------------------|--------------|------|---------|
| [WildBench](https://huggingface.co/datasets/WildEval/WildBench) | in-the-wild user queries | [WildChat]() | XXX | XXX |
"""