ROPE frequency
The transformers repo suggested that this model has a ROPE frequency of 1,000,000. However, there are no "qwen2.rope.freq_base" value in the metadata according to gguf-dump
.
Output of gguf-dump
:
* Loading: qwen1_5-72b-chat-q5_k_m.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 23 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
2: UINT64 | 1 | GGUF.tensor_count = 963
3: UINT64 | 1 | GGUF.kv_count = 20
4: STRING | 1 | general.architecture = 'qwen2'
5: STRING | 1 | general.name = 'Qwen2-beta-72B-Chat'
6: UINT32 | 1 | qwen2.block_count = 80
7: UINT32 | 1 | qwen2.context_length = 32768
8: UINT32 | 1 | qwen2.embedding_length = 8192
9: UINT32 | 1 | qwen2.feed_forward_length = 24576
10: UINT32 | 1 | qwen2.attention.head_count = 64
11: UINT32 | 1 | qwen2.attention.head_count_kv = 64
12: FLOAT32 | 1 | qwen2.attention.layer_norm_rms_epsilon = 9.999999974752427e-07
13: BOOL | 1 | qwen2.use_parallel_residual = True
14: STRING | 1 | tokenizer.ggml.model = 'gpt2'
15: [STRING] | 152064 | tokenizer.ggml.tokens
16: [INT32] | 152064 | tokenizer.ggml.token_type
17: [STRING] | 151387 | tokenizer.ggml.merges
18: UINT32 | 1 | tokenizer.ggml.eos_token_id = 151643
19: UINT32 | 1 | tokenizer.ggml.padding_token_id = 151643
20: UINT32 | 1 | tokenizer.ggml.bos_token_id = 151643
21: STRING | 1 | tokenizer.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['rol"
22: UINT32 | 1 | general.quantization_version = 2
23: UINT32 | 1 | general.file_type = 17
Yeah, the v1.5 models you can pull from https://ollama.ai/library/qwen are missing their ROPE frequency too.
I've patched my Ollama to allow the setting of rope_frequency_base
in the modelfile again, so I can fix this via:
PARAMETER rope_frequency_base 1000000
but it should also be possible to use gguf-set-metadata
to do the same.
I can confirm this does seem to work as without this setting it just ends up outputting repeating newlines after a while - possibly because the default is 10000 (?) and it will make the context 'appear' to fill up 100x quicker to the model. Hopefully this gets fixed soon as I bet a lot of people are running into this problem ( @TheBloke or @LoneStriker should hopefully soon upload a version with the correct value baked in).
Looks like this has been fixed now:
https://twitter.com/justinlin610/status/1757811183707681197?s=46&t=BVhfPLwVzzqRJOcJ7VU3tw
Yes, I have fixed this. I am now also asking ollama to follow my setup.