Spaces:

allenai
/

reward-bench

Running

App Files Files Community

natolambert commited on Mar 9

Commit

7e0e569

•

1 Parent(s): 62907d5

update to reasoning

Browse files

Files changed (3) hide show

src/constants.py +2 -1
src/md.py +2 -1
src/utils.py +7 -0

src/constants.py CHANGED Viewed

@@ -31,6 +31,7 @@ example_counts = {
     "mt-bench-easy": 28,
     "mt-bench-med": 40,
     "mt-bench-hard": 37,
     "refusals-dangerous": 100,
     "refusals-offensive": 100,
     "llmbar-natural": 100,
@@ -53,5 +54,5 @@ subset_mapping = {
     "Chat": ["alpacaeval-easy", "alpacaeval-length", "alpacaeval-hard", "mt-bench-easy", "mt-bench-med"],
     "Chat Hard": ["mt-bench-hard", "llmbar-natural", "llmbar-adver-neighbor", "llmbar-adver-GPTInst", "llmbar-adver-GPTOut", "llmbar-adver-manual"],
     "Safety": ["refusals-dangerous", "refusals-offensive", "xstest-should-refuse", "xstest-should-respond", "donotanswer"],
-    "Code": ["hep-cpp", "hep-go", "hep-java", "hep-js", "hep-python", "hep-rust"]
 }

     "mt-bench-easy": 28,
     "mt-bench-med": 40,
     "mt-bench-hard": 37,
+    "math-prm": 984, # actual length 447, upweighting to be equal to code
     "refusals-dangerous": 100,
     "refusals-offensive": 100,
     "llmbar-natural": 100,
     "Chat": ["alpacaeval-easy", "alpacaeval-length", "alpacaeval-hard", "mt-bench-easy", "mt-bench-med"],
     "Chat Hard": ["mt-bench-hard", "llmbar-natural", "llmbar-adver-neighbor", "llmbar-adver-GPTInst", "llmbar-adver-GPTOut", "llmbar-adver-manual"],
     "Safety": ["refusals-dangerous", "refusals-offensive", "xstest-should-refuse", "xstest-should-respond", "donotanswer"],
+    "Reasoning": ["math-prm", "hep-cpp", "hep-go", "hep-java", "hep-js", "hep-python", "hep-rust"]
 }

src/md.py CHANGED Viewed

@@ -8,7 +8,7 @@ We average over 4 core sections (per prompt weighting):
 1. **Chat**: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
 2. **Chat Hard**: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
 3. **Safety**: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
-4. **Code**: Includes the code subsets (hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)
 5. **Classic Sets**: Includes the test sets ([anthropic_helpful](https://huggingface.co/datasets/Anthropic/hh-rlhf), [anthropic_hhh](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment), [mtbench_human](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments), [shp](https://huggingface.co/datasets/stanfordnlp/SHP), [summarize](https://huggingface.co/datasets/openai/summarize_from_feedback))
 We include multiple types of reward models in this evaluation:
@@ -42,6 +42,7 @@ Total number of the prompts is: 2538, filtered from 4676.
 | xstest-should-refuse | 450, 250         | False response dataset (see [paper](https://arxiv.org/abs/2308.01263))        |
 | xstest-should-respond | 450, 154         | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263))        |
 | do not answer | 939, 136         | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer)        |
 | hep-cpp | 164         | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124))        |
 | hep-go | 164         |   Go code       |
 | hep-java | 164         |  Java code        |

 1. **Chat**: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
 2. **Chat Hard**: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
 3. **Safety**: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
+4. **Reasoning**: Includes the code and math subsets (math-prm, hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)
 5. **Classic Sets**: Includes the test sets ([anthropic_helpful](https://huggingface.co/datasets/Anthropic/hh-rlhf), [anthropic_hhh](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment), [mtbench_human](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments), [shp](https://huggingface.co/datasets/stanfordnlp/SHP), [summarize](https://huggingface.co/datasets/openai/summarize_from_feedback))
 We include multiple types of reward models in this evaluation:
 | xstest-should-refuse | 450, 250         | False response dataset (see [paper](https://arxiv.org/abs/2308.01263))        |
 | xstest-should-respond | 450, 154         | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263))        |
 | do not answer | 939, 136         | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer)        |
+| math-prm | 447         | Human references vs. model error from OpenAI's Let's Verify Step by Step        |
 | hep-cpp | 164         | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124))        |
 | hep-go | 164         |   Go code       |
 | hep-java | 164         |  Java code        |

src/utils.py CHANGED Viewed

@@ -88,6 +88,13 @@ def load_all_data(data_repo, subdir:str, subsubsets=False):    # use HF api to p
     if "summarize_prompted" in cols:
         df = df.drop(columns=["summarize_prompted"])
         cols.remove("summarize_prompted")
     # round
     df[cols] = df[cols].round(2)

     if "summarize_prompted" in cols:
         df = df.drop(columns=["summarize_prompted"])
         cols.remove("summarize_prompted")
+    # remove pku_better and pku_safer (removed from the leaderboard)
+    if "pku_better" in cols:
+        df = df.drop(columns=["pku_better"])
+        cols.remove("pku_better")
+    if "pku_safer" in cols:
+        df = df.drop(columns=["pku_safer"])
+        cols.remove("pku_safer")
     # round
     df[cols] = df[cols].round(2)