ProphetOfBostrom
commited on
Commit
•
4a3634a
1
Parent(s):
7ff19bb
less junk and tags for readme
Browse files
README.md
CHANGED
@@ -1,41 +1,29 @@
|
|
|
|
1 |
license: cc-by-nc-4.0
|
|
|
|
|
2 |
---
|
|
|
3 |
## NeverSleep's [Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss](https://huggingface.co/NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss) but 17GB at 2BPW+
|
4 |
### the other 14 shannons will be remembered. [HQQ quantized](https://mobiusml.github.io/hqq_blog/) to 2 bits with 4 bit attention. Fits on a 3090 with room to grow. Supports full 32k context. I will not combine those assertions.
|
5 |
-
The attention tensors are 4 bit because mixtral reuses it for each expert - so it's only adding 0.4 GB and the quality improve dramatically. See [this](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ) but
|
6 |
|
7 |
-
### This is a 2+4 bit quantization of noromixmaidblah (just scroll down) using an emerging and aparrently very robust quantization method Half-Quadratic Quantisation. It ultimately squeezes it's tokens out of HF Transformers, not one ofthe *lesser* inference tools. So what's juicy about this is that it *functions* with full Transformers sampler and tokeniser support but you only need a 3090 instead of a H100! Truly emancipatory
|
8 |
|
9 |
...I'll do something smaller next time.
|
10 |
|
11 |
My unwitting and presumably unwilling collaborators were the very clever people at [mobiusml - see their freaky maths at their github blog mini paper thing for HQQ](https://github.com/mobiusml/hqq). It's compatible with HF Transformers (including contrastive search baybee!) and is supported out of the box (I think) on text-generation-webui.
|
12 |
For mobius's own description of what this is, see the template I followed, their quantization of a vanilla mixtral at [mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
|
13 |
|
14 |
-
|
15 |
|
16 |
|
17 |
I *think* this is a functioning quant from one of everone's favorite norovirus inspired language models, Noromaid. I wouldn't know - I can't load 90 gigabytes of BF16 so this is my first few minutes too.
|
18 |
|
19 |
-
####
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
On the off chance someone at Mobius see this - please don't ask transformers to load a 45B param model on to the CPU if you're not actually going to... call the model at all? It took ten minutes at SATA 2 speeds - and that was because it was padded to FP32 (CPU mode, right?).
|
24 |
-
```45 Gigaweights \* 2 Bytes per weight \* fp32/bf16 = 180 GB of system memory allocated.```
|
25 |
-
I wish I had one of those.
|
26 |
-
|
27 |
-
\**May have been zswap's fault. I'm pretty sure 200MB/s and an idle CPU isn't the best you can hope for when you're doing sequential reads from a 4.0x4 NVME device? My GPU fell asleep between optimization passes. It even has a Gamer LED on it. I'll fix my sysctl next time.
|
28 |
-
|
29 |
-
+ Try `$ python -i untitled.py`
|
30 |
-
|
31 |
-
having saved that script from the mobius hf repo because you'll be spending a while in IDLE figuring out
|
32 |
-
+ `>>> model.save_quantized("/absolute/path/noromaid") `
|
33 |
-
|
34 |
-
at the end and trust me, quantizing something chunky and then watching python shred it because the save directory is somehow a recursive lambda function and not a string is heartbreaking. I don't know if it was supposed to emit more than the model.pt and the config.json but I'm taking what I can get.
|
35 |
-
|
36 |
-
###### If anyone's looking to donate I could do with an Epyc Rome and perhaps another pair of H100s? I've embedded my XMR address in attention tensors with help from a realy horny embedding so when it starts generating gibberish right before the good stuff just paste that in to feather and send me all your money. Thanks! :)
|
37 |
-
|
38 |
-
`i'm joking. that's a joke. I didn't do that.`
|
39 |
|
40 |
---
|
41 |
# Original README from the Neversleep twins follows:
|
|
|
1 |
+
---
|
2 |
license: cc-by-nc-4.0
|
3 |
+
library_name: transformers
|
4 |
+
pipeline_tag: text-generation
|
5 |
---
|
6 |
+
|
7 |
## NeverSleep's [Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss](https://huggingface.co/NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss) but 17GB at 2BPW+
|
8 |
### the other 14 shannons will be remembered. [HQQ quantized](https://mobiusml.github.io/hqq_blog/) to 2 bits with 4 bit attention. Fits on a 3090 with room to grow. Supports full 32k context. I will not combine those assertions.
|
9 |
+
The attention tensors are 4 bit because mixtral reuses it for each expert - so it's only adding 0.4 GB and the quality improve dramatically. See [this](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ) but horny and dying of chatml m<|alig>|nant tokenitis.|>
|
10 |
|
11 |
+
### This is a 2+4 bit quantization of noromixmaidblah (just scroll down) using an emerging and aparrently very robust quantization method Half-Quadratic Quantisation. It ultimately squeezes it's tokens out of HF Transformers, not one ofthe *lesser* inference tools. So what's juicy about this is that it *functions* with full Transformers sampler and tokeniser support but you only need a 3090 instead of a H100! Truly emancipatory.
|
12 |
|
13 |
...I'll do something smaller next time.
|
14 |
|
15 |
My unwitting and presumably unwilling collaborators were the very clever people at [mobiusml - see their freaky maths at their github blog mini paper thing for HQQ](https://github.com/mobiusml/hqq). It's compatible with HF Transformers (including contrastive search baybee!) and is supported out of the box (I think) on text-generation-webui.
|
16 |
For mobius's own description of what this is, see the template I followed, their quantization of a vanilla mixtral at [mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
|
17 |
|
18 |
+
+ my best guess at parsing the HQQ source is that it works by sort of... 'JIT de-quanti-'' I have no idea, really. If you prefer talking to human beings from being lied to by language models (why are you here?) you could probably ask the MobiusML - they seem friendly and compsci/engineer types tend to enjoy talking about their research and development. Weirdos.
|
19 |
|
20 |
|
21 |
I *think* this is a functioning quant from one of everone's favorite norovirus inspired language models, Noromaid. I wouldn't know - I can't load 90 gigabytes of BF16 so this is my first few minutes too.
|
22 |
|
23 |
+
#### see my oom-killer nightmare log. (my struggle with baby's first quant) in the other markdown file.
|
24 |
+
|
25 |
+
But even if you do want to know what I've learned - you're better off just asking me than trying to parse *that*.
|
26 |
+
Just read the original card please:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
---
|
29 |
# Original README from the Neversleep twins follows:
|