mukel commited on
Commit
061ac72
1 Parent(s): cb3e437

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -3
README.md CHANGED
@@ -1,3 +1,25 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - java
5
+ - qwen2
6
+ - qwen2.java
7
+ ---
8
+ # Pure quantizations of `Qwen2-Math-7B-Instruct` for [qwen2.java](https://github.com/mukel/qwen2.java).
9
+
10
+ In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are rarely pure e.g. the output.weights tensor is quantized with Q6_K, instead of Q4_0.
11
+ A pure Q4_0 quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp as follows:
12
+
13
+ ```
14
+ ./quantize --pure ./Qwen2-7B-Math-Instruct-F16.gguf ./Qwen2-7B-Math-Instruct-Q4_0.gguf Q4_0
15
+ ```
16
+
17
+ Original model: [https://huggingface.co/Qwen/Qwen2-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Math-7B-Instruct)
18
+
19
+
20
+ ## Model Details
21
+
22
+
23
+ For more details, please refer to the original [blog post](https://qwenlm.github.io/blog/qwen2-math/) and [GitHub repo](https://github.com/QwenLM/Qwen2-Math).
24
+
25
+