Spaces:

hallucinations-leaderboard
/

leaderboard

Running on CPU Upgrade

App Files Files Community

aryopg commited on Jan 9

Commit

d820a9b

•

1 Parent(s): 310805a

Xuanli's update: Add reproducibility section

Browse files

Files changed (1) hide show

src/display/about.py +37 -3

src/display/about.py CHANGED Viewed

@@ -42,9 +42,43 @@ For all these evaluations, a higher score is a better score.
 - You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the 📄 emoji after the model name
 # Reproducibility
-Hyperparameters: XXX
-Device(s): XXX
-Metrics: XXX
 """
 FAQ_TEXT = """

 - You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the 📄 emoji after the model name
 # Reproducibility
+To reproduce our results, here is the commands you can run, using [this script](https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard/blob/main/backend-cli.py): python backend-cli.py.
+Alternatively, if you're interested in evaluating a specific task with a particular model, you can use [this script](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
+`python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,revision=<your_model_revision>"`
+` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>` (Note that you may need to add tasks from [here](https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard/tree/main/src/backend/tasks) to [this folder](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463/lm_eval/tasks))
+The total batch size we get for models which fit on one A100 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit. You can expect results to vary slightly for different batch sizes because of padding.
+The tasks and few shots parameters are:
+- NQ Open: 64-shot, *nq_open* (`exact_match`)
+- NQ Open 8: 8-shot, *nq8* (`exact_match`)
+- TriviaQA: 64-shot, *triviaqa* (`exact_match`)
+- TriviaQA 8: 8-shot, *tqa8* (`exact_match`)
+- TruthfulQA MC1: 0-shot, *truthfulqa_mc1* (`acc`)
+- TruthfulQA MC2: 0-shot, *truthfulqa_mc2* (`acc`)
+- HaluEval QA: 0-shot, *halueval_qa* (`em`)
+- HaluEval Summ: 0-shot, *halueval_summarization* (`em`)
+- HaluEval Dial: 0-shot, *halueval_dialogue* (`em`)
+- XSum: 2-shot, *xsum* (`rougeLsum`)
+- CNN/DM: 2-shot, *cnndm* (`rougeLsum`)
+- MemoTrap: 0-shot, *memo-trap* (`acc`)
+- IFEval: 0-shot, *ifeval* (`prompt_level_strict_acc`)
+- SelfCheckGPT: 0-shot, *selfcheckgpt* (``)
+- FEVER: 16-shot, *fever10* (`acc`)
+- SQuADv2: 4-shot, *squadv2* (`squad_v2`)
+- TrueFalse: 8-shot, *truefalse_cieacf* (`acc`)
+- FaithDial: 8-shot, *faithdial_hallu* (`acc`)
+- RACE: 0-shot, *race* (`acc`)
+## Icons
+- {ModelType.PT.to_str(" : ")} model: new, base models, trained on a given corpora
+- {ModelType.FT.to_str(" : ")} model: pretrained models finetuned on more data
+Specific fine-tune subcategories (more adapted to chat):
+- {ModelType.IFT.to_str(" : ")} model: instruction fine-tunes, which are model fine-tuned specifically on datasets of task instruction
+- {ModelType.RL.to_str(" : ")} model: reinforcement fine-tunes, which usually change the model loss a bit with an added policy.
+If there is no icon, we have not uploaded the information on the model yet, feel free to open an issue with the model information!
 """
 FAQ_TEXT = """