Not able to reproduce benchmark metrics

by akjindal53244 - opened Aug 7, 2024

Aug 7, 2024

Hi, congrats on launch of Llama-Spark model!
I am trying to reproduce some of the benchmarks but getting different metrics from ones reported in model card.

For example:

Math Hard Benchmark

Here is the command I am running on lm-eval-harness repo: accelerate launch -m lm_eval --model hf --model_args "pretrained=arcee-ai/Llama-Spark" --tasks leaderboard_math_hard --batch_size 32 --apply_chat_template --fewshot_as_multiturn --num_fewshot 4

Output:

Running generate_until requests: 100%|██████████████████████████████████████████████████| 169/169 [18:07<00:00,  6.43s/it]
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
hf (pretrained=arcee-ai/Llama-Spark), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: 32
|                    Tasks                    |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
| - leaderboard_math_algebra_hard             |      1|none  |     4|exact_match|↑  |0.0033|±  |0.0033|
| - leaderboard_math_counting_and_prob_hard   |      1|none  |     4|exact_match|↑  |0.0163|±  |0.0115|
| - leaderboard_math_geometry_hard            |      1|none  |     4|exact_match|↑  |0.0000|±  |0.0000|
|leaderboard_math_hard                        |N/A    |none  |     4|exact_match|↑  |0.0053|±  |0.0020|
| - leaderboard_math_intermediate_algebra_hard|      1|none  |     4|exact_match|↑  |0.0036|±  |0.0036|
| - leaderboard_math_num_theory_hard          |      1|none  |     4|exact_match|↑  |0.0065|±  |0.0065|
| - leaderboard_math_prealgebra_hard          |      1|none  |     4|exact_match|↑  |0.0104|±  |0.0073|
| - leaderboard_math_precalculus_hard         |      1|none  |     4|exact_match|↑  |0.0000|±  |0.0000|

|       Groups        |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
|leaderboard_math_hard|N/A    |none  |     4|exact_match|↑  |0.0053|±  | 0.002|

BBH

accelerate launch -m lm_eval --model hf --model_args "pretrained=arcee-ai/Llama-Spark,dtype=bfloat16" --tasks leaderboard_bbh --batch_size 32 --apply_chat_template --fewshot_as_multiturn --num_fewshot 3

Output:

Running loglikelihood requests: 100%|███████████████████████████████████████████████| 31710/31710 [20:19<00:00, 26.01it/s]

hf (pretrained=arcee-ai/Llama-Spark,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: 3, batch_size: 32
|                          Tasks                           |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------------------------------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh                                           |N/A    |none  |     3|acc_norm|↑  |0.5046|±  |0.0063|
| - leaderboard_bbh_boolean_expressions                    |      0|none  |     3|acc_norm|↑  |0.8320|±  |0.0237|
| - leaderboard_bbh_causal_judgement                       |      0|none  |     3|acc_norm|↑  |0.5668|±  |0.0363|
| - leaderboard_bbh_date_understanding                     |      0|none  |     3|acc_norm|↑  |0.4640|±  |0.0316|
| - leaderboard_bbh_disambiguation_qa                      |      0|none  |     3|acc_norm|↑  |0.5400|±  |0.0316|
| - leaderboard_bbh_formal_fallacies                       |      0|none  |     3|acc_norm|↑  |0.5480|±  |0.0315|
| - leaderboard_bbh_geometric_shapes                       |      0|none  |     3|acc_norm|↑  |0.3800|±  |0.0308|
| - leaderboard_bbh_hyperbaton                             |      0|none  |     3|acc_norm|↑  |0.6880|±  |0.0294|
| - leaderboard_bbh_logical_deduction_five_objects         |      0|none  |     3|acc_norm|↑  |0.3720|±  |0.0306|
| - leaderboard_bbh_logical_deduction_seven_objects        |      0|none  |     3|acc_norm|↑  |0.3080|±  |0.0293|
| - leaderboard_bbh_logical_deduction_three_objects        |      0|none  |     3|acc_norm|↑  |0.5960|±  |0.0311|
| - leaderboard_bbh_movie_recommendation                   |      0|none  |     3|acc_norm|↑  |0.4640|±  |0.0316|
| - leaderboard_bbh_navigate                               |      0|none  |     3|acc_norm|↑  |0.6320|±  |0.0306|
| - leaderboard_bbh_object_counting                        |      0|none  |     3|acc_norm|↑  |0.3400|±  |0.0300|
| - leaderboard_bbh_penguins_in_a_table                    |      0|none  |     3|acc_norm|↑  |0.4315|±  |0.0411|
| - leaderboard_bbh_reasoning_about_colored_objects        |      0|none  |     3|acc_norm|↑  |0.6280|±  |0.0306|
| - leaderboard_bbh_ruin_names                             |      0|none  |     3|acc_norm|↑  |0.6360|±  |0.0305|
| - leaderboard_bbh_salient_translation_error_detection    |      0|none  |     3|acc_norm|↑  |0.5360|±  |0.0316|
| - leaderboard_bbh_snarks                                 |      0|none  |     3|acc_norm|↑  |0.6180|±  |0.0365|
| - leaderboard_bbh_sports_understanding                   |      0|none  |     3|acc_norm|↑  |0.7680|±  |0.0268|
| - leaderboard_bbh_temporal_sequences                     |      0|none  |     3|acc_norm|↑  |0.4280|±  |0.0314|
| - leaderboard_bbh_tracking_shuffled_objects_five_objects |      0|none  |     3|acc_norm|↑  |0.2880|±  |0.0287|
| - leaderboard_bbh_tracking_shuffled_objects_seven_objects|      0|none  |     3|acc_norm|↑  |0.2400|±  |0.0271|
| - leaderboard_bbh_tracking_shuffled_objects_three_objects|      0|none  |     3|acc_norm|↑  |0.3160|±  |0.0295|
| - leaderboard_bbh_web_of_lies                            |      0|none  |     3|acc_norm|↑  |0.5080|±  |0.0317|

|    Groups     |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh|N/A    |none  |     3|acc_norm|↑  |0.5046|±  |0.0063|

I am getting pretty different results on both math_hard and BBH benchmarks. Can you share the commands to reproduce same/similar metrics? TIA! :)

Crystalcareai

Arcee AI org Aug 7, 2024

•

edited Aug 7, 2024

The current "leaderboard" benchmark task in lm-eval-harness has some limitations. It tends to produce inconsistent results that don't align closely with the actual leaderboard. When evaluating models using this task, I recommend focusing on relative performance improvements rather than absolute scores. The results can vary significantly depending on factors such as whether you're using the leaderboard task, selecting tasks manually, adjusting batch size, or modifying other parameters. I've noted in the read me as such:

Please note that these scores are consistantly higher than the OpenLLM leaderboard, and should be compared to their relative performance increase not weighed against the leaderboard.

That said, these results were done with (i believe) this commit of lm-eval-harness: 42dc244

using this script:

#!/bin/bash

# Install required package
pip install antlr4-python3-runtime==4.11 immutabledict langdetect

MODEL_PATHS=( # This can be a local directory OR a huggingface repo, put as many as you want to test, it will run them sequentially.
arcee-ai/Llama-Spark
)

tasks=(
"leaderboard"
)

for MODEL_PATH in "${MODEL_PATHS[@]}"; do
  MODEL_NAME=$(basename "$MODEL_PATH")
  RESULTS_DIR="./results/$MODEL_NAME"
  mkdir -p "$RESULTS_DIR"
  
  MODEL_ARGS="trust_remote_code=True,pretrained=$MODEL_PATH,dtype=float16"
  
  for TASK in "${tasks[@]}"; do
   accelerate launch -m lm_eval --model hf --model_args "$MODEL_ARGS" --task="$TASK" --batch_size 4  --output_path "$RESULTS_DIR/$TASK.json"
  done
done

Crystalcareai

Arcee AI org Aug 7, 2024

I'll also rerun them here to verify - happy to update the model card if initial results were incorrect.

akjindal53244

Aug 7, 2024

Thank you @Crystalcareai for rerunning. Kindly share the results once you have them ready :)

Crystalcareai changed discussion status to closed Sep 9, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment