Create README.md

3441e37 about 1 year ago

9.79 kB

	---
	inference: false
	license: llama2
	model_creator: WizardLM
	model_link: https://huggingface.co/WizardLM/WizardLM-70B-V1.0
	model_name: WizardLM 70B V1.0
	model_type: llama
	quantized_by: Thireus
	---

	# WizardLM 70B V1.0 – EXL2
	- Model creator: [WizardLM](https://huggingface.co/WizardLM)
	- FP32 Original model used for quantization: [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) – float32
	- FP16 Model used for quantization: [WizardLM 70B V1.0-HF](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) – float16 of [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)
	- BF16 Model used for quantization: [WizardLM 70B V1.0-BF16](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16) – bfloat16 of [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)

	## Models available:

	\| Link \| BITS (-b) \| HEAD BITS (-hb) \| MEASU-REMENT LENGTH (-ml) \| LENGTH (-l) \| CAL DATASET (-c) \| Size \| V. \| Max Context Length \| Base Model \| Layers \| VRAM Min* \| VRAM Max* \| PPL** \| Comments                                                                                                                         \|
	\| ------ \| --------- \| --------------- \| ------------------------ \| ----------- \| ---------------- \| ---- \| ------- \| ------------------ \| ---- \| ---- \|------------------ \| ------------------ \| ------------------ \| ---------------------------------------------------------------------------------- \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-FP32-4.0bpw-h6-exl2/) \| 4.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 33GB \| [0.0.2](https://github.com/turboderp/exllamav2/tree/c0dd3412d59c0bc776264512bf76264e954c221d) \| 4096 \| [FP32](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) \| 80 \| 39GB \| 44GB \| 4.15234375 \| Good results \| \| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-4.0bpw-h6-exl2/) \| 4.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 35GB \| [0.0.1](https://github.com/turboderp/exllamav2/tree/aee7a281708d5faff2ad0ea4b3a3a4b754f458f3) \| 4096 \| [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) \| 80 \| 40GB \| 44GB \| 4.1640625 \| Model suffers from poor prompt understanding and logic is affected \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16-4.0bpw-h6-exl2/) \| 4.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 33GB \| [0.0.2](https://github.com/turboderp/exllamav2/tree/ec5164b8a8e282b91aedb2af94dfeb89887656b7) \| 4096 \| [BF16](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16) \| 80 \| 39GB \| 44GB \| 4.2421875 \| Model suffers from poor prompt understanding and logic is affected \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-4.0bpw-h8-exl2/) \| 4.0 \| 8 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 35GB \| [0.0.2](https://github.com/turboderp/exllamav2/tree/a4f2663e310919f007c593030d56ca110f99c261) \| 4096 \| [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) \| 80 \| 39GB \| 44GB \| 4.24609375 \| Model suffers from poor prompt understanding and logic is affected \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-FP32-5.0bpw-h6-exl2/) \| 5.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 41GB \| [0.0.2](https://github.com/turboderp/exllamav2/tree/c0dd3412d59c0bc776264512bf76264e954c221d) \| 4096 \| [FP32](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) \| 80 \| 47GB \| 52GB \| 4.06640625 \| Best so far. Good results \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-5.0bpw-h8-exl2/) \| 5.0 \| 8 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 44GB \| [0.0.2](https://github.com/turboderp/exllamav2/tree/a4f2663e310919f007c593030d56ca110f99c261) \| 4096 \| [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) \| 80 \| 48GB \| 52GB \| 4.09765625 \| Model suffers from poor prompt understanding and logic is affected \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-5.0bpw-h6-exl2/) \| 5.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 44GB \| [0.0.1](https://github.com/turboderp/exllamav2/tree/aee7a281708d5faff2ad0ea4b3a3a4b754f458f3) \| 4096 \| [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) \| 80 \| 48GB \| 52GB \| 4.0625 \| Model suffers from poor prompt understanding and logic is affected \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16-5.0bpw-h6-exl2/) \| 5.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 41GB \| [0.0.2](https://github.com/turboderp/exllamav2/tree/ec5164b8a8e282b91aedb2af94dfeb89887656b7) \| 4096 \| [BF16](https://huggingface.co/Thireus/WizardLM-70B-V1.0-BF16) \| 80 \| 47GB \| 52GB \| 4.09765625 \| Model suffers from poor prompt understanding and logic is affected \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-6.0bpw-h6-exl2/) \| 6.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 49GB \| [0.0.2](https://github.com/turboderp/exllamav2/tree/fae6fb296c6db4e3b1314c49c030541bed98acb9) \| 4096 \| [FP16](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) \| 80 \| 56GB \| 60GB \| 4.0703125 \| Model suffers from poor prompt understanding and logic is affected \|


	\* wikitext-2-raw-v1

	\\ Evaluated with text-generation-webui ExLlama v0.0.2 on wikitext-2-raw-v1 (stride 512 and max_length 0). For reference, [TheBloke_WizardLM-70B-V1.0-GPTQ_gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/WizardLM-70B-V1.0-GPTQ/tree/gptq-4bit-32g-actorder_True) has a score of 4.1015625 in perplexity.

	\\\* Without Flash Attention - For better VRAM optimisation, make sure you install https://github.com/Dao-AILab/flash-attention#installation-and-features

	## Description:

	_This repository contains EXL2 model files for [WizardLM's WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)._

	EXL2 is a new format used by ExLlamaV2 – https://github.com/turboderp/exllamav2. EXL2 is based on the same optimization method as GPTQ. The format allows for mixing quantization
	levels within a model to achieve any average bitrate between 2 and 8 bits per weight.

	## Prompt template (official):

	```
	A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {prompt} ASSISTANT:
	```

	## Prompt template (suggested):

	```
	A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
	USER:
	{prompt}
	ASSISTANT:


	```

	## Quantization process:

	\| Original Model \| → \| (optional) float16 or bfloat16 Model* \| → \| Safetensors Model** \| → \| EXL2 Model \|
	\| -------------- \| --- \| ------------- \| --- \| ---------------- \| --- \| ---------- \|
	\| [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) \| → \| [WizardLM 70B V1.0-HF](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF)* \| → \| Safetensors** \| → \| EXL2 \|

	Example to convert WizardLM-70B-V1.0-HF to EXL2 4.0 bpw with 6-bit head:

	```
	mkdir -p ~/EXL2/WizardLM-70B-V1.0-HF_4bit # Create the output directory
	python convert.py -i ~/float16_safetensored/WizardLM-70B-V1.0-HF -o ~/EXL2/WizardLM-70B-V1.0-HF_4bit -c ~/EXL2/0000.parquet -b 4.0 -hb 6
	```

	\* Use the following script to convert your local pytorch_model bin files to float16 (you can also choose bfloat16) + safetensors all in one go:

	- https://github.com/oobabooga/text-generation-webui/blob/main/convert-to-safetensors.py
	(best for sharding and float16/FP16 or bfloat16/BF16 conversion)

	Example to convert [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) directly to float16 safetensors in 10GB shards:

	```
	python convert-to-safetensors.py ~/original/WizardLM-70B-V1.0 --output ~/float16_safetensored/WizardLM-70B-V1.0 --max-shard-size 10GB
	```

	Use `--bf16` if you'd like to try bfloat16 instead, but note that there are concerns about quantization quality – https://github.com/turboderp/exllamav2/issues/30#issuecomment-1719009289

	\\ Use any one of the following scripts to convert your local pytorch_model bin files to safetensors:

	- https://github.com/turboderp/exllamav2/blob/master/util/convert_safetensors.py (official ExLlamaV2)
	- https://huggingface.co/Panchovix/airoboros-l2-70b-gpt4-1.4.1-safetensors/blob/main/bin2safetensors/convert.py (recommended)
	- https://gist.github.com/epicfilemcnulty/1f55fd96b08f8d4d6693293e37b4c55e#file-2safetensors-py

	## Further reading:

	- https://mlabonne.github.io/blog/posts/Introduction_to_Weight_Quantization.html