changing order of int8 and bf16
Browse files
README.md
CHANGED
@@ -127,14 +127,14 @@ def main() -> None:
|
|
127 |
modified_input_text = f"<human>: {input_text}\n<bot>:"
|
128 |
```
|
129 |
|
130 |
-
Running command for int8 (sub optimal performance, but fast inference time):
|
131 |
-
```
|
132 |
-
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
133 |
-
```
|
134 |
Running command for bf16
|
135 |
```
|
136 |
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
137 |
```
|
|
|
|
|
|
|
|
|
138 |
**DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
|
139 |
|
140 |
### Suggested Inference Parameters
|
|
|
127 |
modified_input_text = f"<human>: {input_text}\n<bot>:"
|
128 |
```
|
129 |
|
|
|
|
|
|
|
|
|
130 |
Running command for bf16
|
131 |
```
|
132 |
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
133 |
```
|
134 |
+
Running command for int8 (sub optimal performance, but fast inference time):
|
135 |
+
```
|
136 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
137 |
+
```
|
138 |
**DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
|
139 |
|
140 |
### Suggested Inference Parameters
|