Not able to reproduce benchmark metrics

#2
by akjindal53244 - opened

Hi, congrats on launch of Llama-Spark model!
I am trying to reproduce some of the benchmarks but getting different metrics from ones reported in model card.

For example:

Math Hard Benchmark

Here is the command I am running on lm-eval-harness repo: accelerate launch -m lm_eval --model hf --model_args "pretrained=arcee-ai/Llama-Spark" --tasks leaderboard_math_hard --batch_size 32 --apply_chat_template --fewshot_as_multiturn --num_fewshot 4

Output:

Running generate_until requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 169/169 [18:07<00:00,  6.43s/it]
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
hf (pretrained=arcee-ai/Llama-Spark), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: 32
|                    Tasks                    |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
| - leaderboard_math_algebra_hard             |      1|none  |     4|exact_match|↑  |0.0033|Β±  |0.0033|
| - leaderboard_math_counting_and_prob_hard   |      1|none  |     4|exact_match|↑  |0.0163|Β±  |0.0115|
| - leaderboard_math_geometry_hard            |      1|none  |     4|exact_match|↑  |0.0000|Β±  |0.0000|
|leaderboard_math_hard                        |N/A    |none  |     4|exact_match|↑  |0.0053|Β±  |0.0020|
| - leaderboard_math_intermediate_algebra_hard|      1|none  |     4|exact_match|↑  |0.0036|Β±  |0.0036|
| - leaderboard_math_num_theory_hard          |      1|none  |     4|exact_match|↑  |0.0065|Β±  |0.0065|
| - leaderboard_math_prealgebra_hard          |      1|none  |     4|exact_match|↑  |0.0104|Β±  |0.0073|
| - leaderboard_math_precalculus_hard         |      1|none  |     4|exact_match|↑  |0.0000|Β±  |0.0000|

|       Groups        |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
|leaderboard_math_hard|N/A    |none  |     4|exact_match|↑  |0.0053|Β±  | 0.002|

BBH

accelerate launch -m lm_eval --model hf --model_args "pretrained=arcee-ai/Llama-Spark,dtype=bfloat16" --tasks leaderboard_bbh --batch_size 32 --apply_chat_template --fewshot_as_multiturn --num_fewshot 3

Output:

Running loglikelihood requests: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 31710/31710 [20:19<00:00, 26.01it/s]

hf (pretrained=arcee-ai/Llama-Spark,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: 3, batch_size: 32
|                          Tasks                           |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------------------------------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh                                           |N/A    |none  |     3|acc_norm|↑  |0.5046|Β±  |0.0063|
| - leaderboard_bbh_boolean_expressions                    |      0|none  |     3|acc_norm|↑  |0.8320|Β±  |0.0237|
| - leaderboard_bbh_causal_judgement                       |      0|none  |     3|acc_norm|↑  |0.5668|Β±  |0.0363|
| - leaderboard_bbh_date_understanding                     |      0|none  |     3|acc_norm|↑  |0.4640|Β±  |0.0316|
| - leaderboard_bbh_disambiguation_qa                      |      0|none  |     3|acc_norm|↑  |0.5400|Β±  |0.0316|
| - leaderboard_bbh_formal_fallacies                       |      0|none  |     3|acc_norm|↑  |0.5480|Β±  |0.0315|
| - leaderboard_bbh_geometric_shapes                       |      0|none  |     3|acc_norm|↑  |0.3800|Β±  |0.0308|
| - leaderboard_bbh_hyperbaton                             |      0|none  |     3|acc_norm|↑  |0.6880|Β±  |0.0294|
| - leaderboard_bbh_logical_deduction_five_objects         |      0|none  |     3|acc_norm|↑  |0.3720|Β±  |0.0306|
| - leaderboard_bbh_logical_deduction_seven_objects        |      0|none  |     3|acc_norm|↑  |0.3080|Β±  |0.0293|
| - leaderboard_bbh_logical_deduction_three_objects        |      0|none  |     3|acc_norm|↑  |0.5960|Β±  |0.0311|
| - leaderboard_bbh_movie_recommendation                   |      0|none  |     3|acc_norm|↑  |0.4640|Β±  |0.0316|
| - leaderboard_bbh_navigate                               |      0|none  |     3|acc_norm|↑  |0.6320|Β±  |0.0306|
| - leaderboard_bbh_object_counting                        |      0|none  |     3|acc_norm|↑  |0.3400|Β±  |0.0300|
| - leaderboard_bbh_penguins_in_a_table                    |      0|none  |     3|acc_norm|↑  |0.4315|Β±  |0.0411|
| - leaderboard_bbh_reasoning_about_colored_objects        |      0|none  |     3|acc_norm|↑  |0.6280|Β±  |0.0306|
| - leaderboard_bbh_ruin_names                             |      0|none  |     3|acc_norm|↑  |0.6360|Β±  |0.0305|
| - leaderboard_bbh_salient_translation_error_detection    |      0|none  |     3|acc_norm|↑  |0.5360|Β±  |0.0316|
| - leaderboard_bbh_snarks                                 |      0|none  |     3|acc_norm|↑  |0.6180|Β±  |0.0365|
| - leaderboard_bbh_sports_understanding                   |      0|none  |     3|acc_norm|↑  |0.7680|Β±  |0.0268|
| - leaderboard_bbh_temporal_sequences                     |      0|none  |     3|acc_norm|↑  |0.4280|Β±  |0.0314|
| - leaderboard_bbh_tracking_shuffled_objects_five_objects |      0|none  |     3|acc_norm|↑  |0.2880|Β±  |0.0287|
| - leaderboard_bbh_tracking_shuffled_objects_seven_objects|      0|none  |     3|acc_norm|↑  |0.2400|Β±  |0.0271|
| - leaderboard_bbh_tracking_shuffled_objects_three_objects|      0|none  |     3|acc_norm|↑  |0.3160|Β±  |0.0295|
| - leaderboard_bbh_web_of_lies                            |      0|none  |     3|acc_norm|↑  |0.5080|Β±  |0.0317|

|    Groups     |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------------|-------|------|-----:|--------|---|-----:|---|-----:|
|leaderboard_bbh|N/A    |none  |     3|acc_norm|↑  |0.5046|Β±  |0.0063|

I am getting pretty different results on both math_hard and BBH benchmarks. Can you share the commands to reproduce same/similar metrics? TIA! :)

Arcee AI org
β€’
edited Aug 7

The current "leaderboard" benchmark task in lm-eval-harness has some limitations. It tends to produce inconsistent results that don't align closely with the actual leaderboard. When evaluating models using this task, I recommend focusing on relative performance improvements rather than absolute scores. The results can vary significantly depending on factors such as whether you're using the leaderboard task, selecting tasks manually, adjusting batch size, or modifying other parameters. I've noted in the read me as such:

Please note that these scores are consistantly higher than the OpenLLM leaderboard, and should be compared to their relative performance increase not weighed against the leaderboard.

That said, these results were done with (i believe) this commit of lm-eval-harness: 42dc244

using this script:

#!/bin/bash

# Install required package
pip install antlr4-python3-runtime==4.11 immutabledict langdetect

MODEL_PATHS=( # This can be a local directory OR a huggingface repo, put as many as you want to test, it will run them sequentially.
arcee-ai/Llama-Spark
)

tasks=(
"leaderboard"
)

for MODEL_PATH in "${MODEL_PATHS[@]}"; do
  MODEL_NAME=$(basename "$MODEL_PATH")
  RESULTS_DIR="./results/$MODEL_NAME"
  mkdir -p "$RESULTS_DIR"
  
  MODEL_ARGS="trust_remote_code=True,pretrained=$MODEL_PATH,dtype=float16"
  
  for TASK in "${tasks[@]}"; do
   accelerate launch -m lm_eval --model hf --model_args "$MODEL_ARGS" --task="$TASK" --batch_size 4  --output_path "$RESULTS_DIR/$TASK.json"
  done
done
Arcee AI org

I'll also rerun them here to verify - happy to update the model card if initial results were incorrect.

Thank you @Crystalcareai for rerunning. Kindly share the results once you have them ready :)

Crystalcareai changed discussion status to closed

Sign up or log in to comment