Default context setting seems borked
llama.cpp (built from git HEAD) thinks the Q8_0 quant has a context of 1024000, which doesn't match any of the numbers I see in the model card.
[1722691579] llama_model_loader: - kv 16: llama.context_length u32 = 1024000
[1722691579] llm_load_print_meta: n_ctx_train = 1024000
[1722691579] llm_load_print_meta: n_ctx_orig_yarn = 1024000
[1722691579] llama_new_context_with_model: n_ctx = 1024000
Using the -c 0
default setting of the command-line hung it and crashed my Macbook hard, probably because of memory allocation
[1722691589] llama_kv_cache_init: Metal KV buffer size = 160000.00 MiB
-c 131072
seems to work fine. Was 131072 intended instead?
I am a bit unsure in this case, 131072 is the max context it supports to my knowledge in the model card.
The other value was automatically decided by the conversion script based on the max positional embeddings defined in the original model.
I looked at other nemo quants out there and they all appear to have this quirk, I'm unsure if I should change the config.json to fake it and make a new quant since that seems like it may break more than it solves.
For now I recommend launching with -c 131072