Passkey evaluation on Flash Infer backend
#16
by
joejose2728
- opened
I'm trying to evaluate this model on passkey dataset with 32k and 125k tokens.
Setup:
- Flash Infer backend
- Updated vllm
llama.py
to support interleaved SWA - SW size [full, 32768, 32768, 32768]
lm-eval
0.4.4
Results:
Output from the model for 125k context:
"resps": [["<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>"]],"filtered_resps": ["<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>"],
Output from the model for 32k context:
SWA: [full, 32768, 32768, 32768]
"resps": [
[
"50489"
]
],
"filtered_resps": [
"50489"
],
"exact_match": 1.0
SWA: [full, 16384, 16384, 16384]
"resps": [
[
"50489 is the pass key"
]
],
"filtered_resps": [
"50489 is the pass key"
],
"exact_match": 0.0
Questions
- Any idea why the model outputs garbage for 125k?
- Do we need to update RoPE scaling for longer context?