ibm-fms
/

llama-13b-accelerator

Transformers

Safetensors

mlp_speculator

Inference Endpoints

Model card Files Files and versions Community

JRosenkranz commited on Apr 5

Commit

593ddda

•

1 Parent(s): a24b598

updated samples

Browse files

Files changed (1) hide show

README.md +24 -11

README.md CHANGED Viewed

@@ -10,8 +10,13 @@ This model as intended to be used as an accelerator for llama 13B (chat).
 Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
 Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
-To try this out running in a production-like environment, please use the pre-built docker image:
 ```bash
 docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7 docker run -d --rm --gpus all \
@@ -35,17 +40,31 @@ git clone --branch speculative-decoding --single-branch https://github.com/tdoub
 cd text-generation-inference/integration_tests
 make gen-client
 pip install . --no-cache-dir
 python sample_client.py
 ```
-To try this out with the fms-native compiled model, please execute the following:
-#### batch_size=1 (compile + cudagraphs)
 ```bash
 git clone https://github.com/foundation-model-stack/fms-extras
 (cd fms-extras && pip install -e .)
 pip install transformers==4.35.0 sentencepiece numpy
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=13b \
     --model_path=/path/to/model_weights/llama/13B-F \
@@ -57,12 +76,9 @@ python fms-extras/scripts/paged_speculative_inference.py \
     --compile_mode=reduce-overhead
 ```
-#### batch_size=1 (compile)
 ```bash
-git clone https://github.com/foundation-model-stack/fms-extras
-(cd fms-extras && pip install -e .)
-pip install transformers==4.35.0 sentencepiece numpy
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=13b \
     --model_path=/path/to/model_weights/llama/13B-F \
@@ -73,12 +89,9 @@ python fms-extras/scripts/paged_speculative_inference.py \
     --compile \
 ```
-#### batch_size=4 (compile)
 ```bash
-git clone https://github.com/foundation-model-stack/fms-extras
-(cd fms-extras && pip install -e .)
-pip install transformers==4.35.0 sentencepiece numpy
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=13b \
     --model_path=/path/to/model_weights/llama/13B-F \

 Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
 Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
+## Samples
+### Production Server Sample
+*To try this out running in a production-like environment, please use the pre-built docker image:*
+#### Setup
 ```bash
 docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7 docker run -d --rm --gpus all \
 cd text-generation-inference/integration_tests
 make gen-client
 pip install . --no-cache-dir
+```
+#### Run Sample
+```bash
 python sample_client.py
 ```
+### Minimal Sample
+*To try this out with the fms-native compiled model, please execute the following:*
+#### Install
 ```bash
 git clone https://github.com/foundation-model-stack/fms-extras
 (cd fms-extras && pip install -e .)
 pip install transformers==4.35.0 sentencepiece numpy
+```
+#### Run Sample
+##### batch_size=1 (compile + cudagraphs)
+```bash
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=13b \
     --model_path=/path/to/model_weights/llama/13B-F \
     --compile_mode=reduce-overhead
 ```
+##### batch_size=1 (compile)
 ```bash
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=13b \
     --model_path=/path/to/model_weights/llama/13B-F \
     --compile \
 ```
+##### batch_size=4 (compile)
 ```bash
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=13b \
     --model_path=/path/to/model_weights/llama/13B-F \