Thireus
/

WizardLM-70B-V1.0-FP32-4.0bpw-h6-exl2

+---
+inference: false
+license: llama2
+model_creator: WizardLM
+model_link: https://huggingface.co/WizardLM/WizardLM-70B-V1.0
+model_name: WizardLM 70B V1.0
+model_type: llama
+quantized_by: Thireus
+---
+# WizardLM 70B V1.0 – EXL2
+- Model creator: [WizardLM](https://huggingface.co/WizardLM)
+- FP32 Original model used for quantization: [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) – float32
+- FP16 Model used for quantization: [WizardLM 70B V1.0-HF](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) – float16 of [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)
+- BF16 Model used for quantization: [WizardLM 70B V1.0-BF16](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16) – bfloat16 of [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)
+## Models available:
+| Link | BITS (-b) | HEAD BITS (-hb) | MEASU-REMENT LENGTH (-ml) | LENGTH (-l) | CAL DATASET (-c) | Size | V. | Max Context Length | Base Model | Layers | VRAM Min*** | VRAM Max*** | PPL** | Comments&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; |
+| ------ | --------- | --------------- | ------------------------ | ----------- | ---------------- | ---- | ------- | ------------------ | ---- | ---- |------------------ | ------------------ | ------------------ | ---------------------------------------------------------------------------------- |
+| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-FP32-4.0bpw-h6-exl2/) | 4.0 | 6 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 33GB | [0.0.2](https://github.com/turboderp/exllamav2/tree/c0dd3412d59c0bc776264512bf76264e954c221d) | 4096 | [FP32](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) | 80 | 39GB | 44GB | 4.15234375 | Good results | | [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-4.0bpw-h6-exl2/) | 4.0 | 6 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 35GB | [0.0.1](https://github.com/turboderp/exllamav2/tree/aee7a281708d5faff2ad0ea4b3a3a4b754f458f3) | 4096 | [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) | 80 | 40GB | 44GB | 4.1640625 | Model suffers from poor prompt understanding and logic is affected |
+| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16-4.0bpw-h6-exl2/) | 4.0 | 6 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 33GB | [0.0.2](https://github.com/turboderp/exllamav2/tree/ec5164b8a8e282b91aedb2af94dfeb89887656b7) | 4096 | [BF16](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16) | 80 | 39GB | 44GB | 4.2421875 | Model suffers from poor prompt understanding and logic is affected |
+| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-4.0bpw-h8-exl2/) | 4.0 | 8 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 35GB | [0.0.2](https://github.com/turboderp/exllamav2/tree/a4f2663e310919f007c593030d56ca110f99c261) | 4096 | [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) | 80 | 39GB | 44GB | 4.24609375 | Model suffers from poor prompt understanding and logic is affected |
+| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-FP32-5.0bpw-h6-exl2/) | 5.0 | 6 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 41GB | [0.0.2](https://github.com/turboderp/exllamav2/tree/c0dd3412d59c0bc776264512bf76264e954c221d) | 4096 | [FP32](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) | 80 | 47GB | 52GB | 4.06640625 | Best so far. Good results |
+| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-5.0bpw-h8-exl2/) | 5.0 | 8 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 44GB | [0.0.2](https://github.com/turboderp/exllamav2/tree/a4f2663e310919f007c593030d56ca110f99c261) | 4096 | [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) | 80 | 48GB | 52GB | 4.09765625 | Model suffers from poor prompt understanding and logic is affected |
+| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-5.0bpw-h6-exl2/) | 5.0 | 6 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 44GB | [0.0.1](https://github.com/turboderp/exllamav2/tree/aee7a281708d5faff2ad0ea4b3a3a4b754f458f3) | 4096 | [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) | 80 | 48GB | 52GB | 4.0625 | Model suffers from poor prompt understanding and logic is affected |
+| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16-5.0bpw-h6-exl2/) | 5.0 | 6 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 41GB | [0.0.2](https://github.com/turboderp/exllamav2/tree/ec5164b8a8e282b91aedb2af94dfeb89887656b7) | 4096 | [BF16](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16) | 80 | 47GB | 52GB | 4.09765625 | Model suffers from poor prompt understanding and logic is affected |
+| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-6.0bpw-h6-exl2/) | 6.0 | 6 | 2048 | 2048 | [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* | 49GB | [0.0.2](https://github.com/turboderp/exllamav2/tree/fae6fb296c6db4e3b1314c49c030541bed98acb9) | 4096 | [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) | 80 | 56GB | 60GB | 4.0703125 | Model suffers from poor prompt understanding and logic is affected |
+\* wikitext-2-raw-v1
+\*\* Evaluated with text-generation-webui ExLlama v0.0.2 on wikitext-2-raw-v1 (stride 512 and max_length 0). For reference, [TheBloke_WizardLM-70B-V1.0-GPTQ_gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/WizardLM-70B-V1.0-GPTQ/tree/gptq-4bit-32g-actorder_True) has a score of 4.1015625 in perplexity.
+\*\*\* Without Flash Attention - For better VRAM optimisation, make sure you install https://github.com/Dao-AILab/flash-attention#installation-and-features
+## Description:
+_This repository contains EXL2 model files for [WizardLM's WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)._
+EXL2 is a new format used by ExLlamaV2 – https://github.com/turboderp/exllamav2. EXL2 is based on the same optimization method as GPTQ. The format allows for mixing quantization
+levels within a model to achieve any average bitrate between 2 and 8 bits per weight.
+## Prompt template (official):
+```
+A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {prompt} ASSISTANT:
+```
+## Prompt template (suggested):
+```
+A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
+USER:
+{prompt}
+ASSISTANT:
+```
+## Quantization process:
+| Original Model | → | (optional) float16 or bfloat16 Model* | → | Safetensors Model** | → | EXL2 Model |
+| -------------- | --- | ------------- | --- | ---------------- | --- | ---------- |
+| [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) | → | [WizardLM 70B V1.0-HF](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF)* | → | Safetensors** | → | EXL2 |
+Example to convert WizardLM-70B-V1.0-HF to EXL2 4.0 bpw with 6-bit head:
+```
+mkdir -p ~/EXL2/WizardLM-70B-V1.0-HF_4bit # Create the output directory
+python convert.py -i ~/float16_safetensored/WizardLM-70B-V1.0-HF -o ~/EXL2/WizardLM-70B-V1.0-HF_4bit -c ~/EXL2/0000.parquet -b 4.0 -hb 6
+```
+\* Use the following script to convert your local pytorch_model bin files to float16 (you can also choose bfloat16) + safetensors all in one go:
+- https://github.com/oobabooga/text-generation-webui/blob/main/convert-to-safetensors.py
+ (best for sharding and float16/FP16 or bfloat16/BF16 conversion)
+Example to convert [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) directly to float16 safetensors in 10GB shards:
+```
+python convert-to-safetensors.py ~/original/WizardLM-70B-V1.0 --output ~/float16_safetensored/WizardLM-70B-V1.0 --max-shard-size 10GB
+```
+Use `--bf16` if you'd like to try bfloat16 instead, but note that there are concerns about quantization quality – https://github.com/turboderp/exllamav2/issues/30#issuecomment-1719009289
+\*\* Use any one of the following scripts to convert your local pytorch_model bin files to safetensors:
+- https://github.com/turboderp/exllamav2/blob/master/util/convert_safetensors.py (official ExLlamaV2)
+- https://huggingface.co/Panchovix/airoboros-l2-70b-gpt4-1.4.1-safetensors/blob/main/bin2safetensors/convert.py (recommended)
+- https://gist.github.com/epicfilemcnulty/1f55fd96b08f8d4d6693293e37b4c55e#file-2safetensors-py
+## Further reading:
+- https://mlabonne.github.io/blog/posts/Introduction_to_Weight_Quantization.html