hugging-quants
/

Meta-Llama-3.1-405B-Instruct-GPTQ-INT4

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

alvarobartt HF staff commited on Jul 24

Commit

59af7ca

•

1 Parent(s): 57492d6

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -135,7 +135,7 @@ Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows
 docker run --gpus all --shm-size 1g -ti -p 8080:80 \
   -v hf_cache:/data \
   -e MODEL_ID=hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 \
-  -e NUM_SHARD=4 \
   -e QUANTIZE=gptq \
   -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
   -e MAX_INPUT_LENGTH=4000 \
@@ -214,7 +214,7 @@ docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
   vllm/vllm-openai:latest \
   --model hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 \
   --quantization gptq_marlin \
-  --tensor-parallel-size 4 \
   --max-model-len 4096
 ```

 docker run --gpus all --shm-size 1g -ti -p 8080:80 \
   -v hf_cache:/data \
   -e MODEL_ID=hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 \
+  -e NUM_SHARD=8 \
   -e QUANTIZE=gptq \
   -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
   -e MAX_INPUT_LENGTH=4000 \
   vllm/vllm-openai:latest \
   --model hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 \
   --quantization gptq_marlin \
+  --tensor-parallel-size 8 \
   --max-model-len 4096
 ```