ibm-fms
/

llama3-8b-accelerator

Transformers

Safetensors

mlp_speculator

Inference Endpoints

Model card Files Files and versions Community

JRosenkranz commited on Apr 29, 2024

Commit

f31ad8d

2 Parent(s): 54015ba 5c84892

Merge branch 'main' of https://huggingface.co/ibm-fms/llama3-8b-accelerator

Browse files

Files changed (1) hide show

README.md +76 -1

README.md CHANGED Viewed

@@ -67,7 +67,7 @@ docker run --rm \
     ibm-fms/llama3-8b-accelerator \
     --token $HF_HUB_TOKEN
-# note: if the weights were downloaded separately (not with the above commands), please place them in the HF_HUB_CACHE directoy and refer to them with /models/<model_name>
 docker run -d --rm --gpus all \
     --name my-tgis-server \
     -p 8033:8033 \
@@ -92,3 +92,78 @@ cd text-generation-inference/integration_tests
 make gen-client
 pip install . --no-cache-dir
 ```

     ibm-fms/llama3-8b-accelerator \
     --token $HF_HUB_TOKEN
+# note: if the weights were downloaded separately (not with the above commands), please place them in the HF_HUB_CACHE directory and refer to them with /models/<model_name>
 docker run -d --rm --gpus all \
     --name my-tgis-server \
     -p 8033:8033 \
 make gen-client
 pip install . --no-cache-dir
 ```
+#### Run Sample
+```bash
+python sample_client.py
+```
+_Note: first prompt may be slower as there is a slight warmup time_
+### Minimal Sample
+#### Install
+```bash
+git clone --branch llama_3_variants --single-branch https://github.com/JRosenkranz/fms-extras
+(cd fms-extras && pip install -e .)
+pip install transformers==4.35.0 sentencepiece numpy
+```
+#### Run Sample
+##### batch_size=1 (compile + cudagraphs)
+```bash
+MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
+python fms-extras/scripts/paged_speculative_inference.py \
+    --architecture=llama3 \
+    --variant=8b \
+    --model_path=$MODEL_PATH \
+    --model_source=hf \
+    --tokenizer=$MODEL_PATH \
+    --speculator_path=ibm-fms/llama3-8b-accelerator \
+    --speculator_source=hf \
+    --speculator_variant=3_2b \
+    --top_k_tokens_per_head=4,3,2,2 \
+    --compile \
+    --compile_mode=reduce-overhead
+```
+##### batch_size=1 (compile)
+```bash
+MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
+python fms-extras/scripts/paged_speculative_inference.py \
+    --architecture=llama3 \
+    --variant=8b \
+    --model_path=$MODEL_PATH \
+    --model_source=hf \
+    --tokenizer=$MODEL_PATH \
+    --speculator_path=ibm-fms/llama3-8b-accelerator \
+    --speculator_source=hf \
+    --speculator_variant=3_2b \
+    --top_k_tokens_per_head=4,3,2,2 \
+    --compile
+```
+##### batch_size=4 (compile)
+```bash
+MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
+python fms-extras/scripts/paged_speculative_inference.py \
+    --architecture=llama3 \
+    --variant=8b \
+    --model_path=$MODEL_PATH \
+    --model_source=hf \
+    --tokenizer=$MODEL_PATH \
+    --speculator_path=ibm-fms/llama3-8b-accelerator \
+    --speculator_source=hf \
+    --speculator_variant=3_2b \
+    --top_k_tokens_per_head=4,3,2,2 \
+    --batch_input \
+    --compile
+```
+Sample code can be found [here](https://github.com/foundation-model-stack/fms-extras/pull/24)