Passkey evaluation on Flash Infer backend

#16
by joejose2728 - opened

I'm trying to evaluate this model on passkey dataset with 32k and 125k tokens.

Setup:

  • Flash Infer backend
  • Updated vllm llama.py to support interleaved SWA
  • SW size [full, 32768, 32768, 32768]
  • lm-eval 0.4.4

Results:

Output from the model for 125k context:

"resps": [["<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>"]],"filtered_resps": ["<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>"],

Output from the model for 32k context:

SWA: [full, 32768, 32768, 32768]

"resps": [
[
"50489"
]
],
"filtered_resps": [
"50489"
],
"exact_match": 1.0

SWA: [full, 16384, 16384, 16384]

"resps": [
[
"50489 is the pass key"
]
],
"filtered_resps": [
"50489 is the pass key"
],
"exact_match": 0.0

Questions

  • Any idea why the model outputs garbage for 125k?
  • Do we need to update RoPE scaling for longer context?

Sign up or log in to comment