sambanovasystems
/

BLOOMChat-176B-v1

Text Generation

text-generation-inference

Model card Files Files and versions Community

jayr014 commited on May 18, 2023

Commit

e96382c

•

1 Parent(s): f115b64

changing order of int8 and bf16

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -127,14 +127,14 @@ def main() -> None:
         modified_input_text = f"<human>: {input_text}\n<bot>:"
 ```
-Running command for int8 (sub optimal performance, but fast inference time):
-```
-python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
-```
 Running command for bf16
 ```
 python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
 ```
 **DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
 ### Suggested Inference Parameters

         modified_input_text = f"<human>: {input_text}\n<bot>:"
 ```
 Running command for bf16
 ```
 python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
 ```
+Running command for int8 (sub optimal performance, but fast inference time):
+```
+python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
+```
 **DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
 ### Suggested Inference Parameters