ibm-fms
/

llama-13b-accelerator

Inference Endpoints

Model card Files Files and versions Community

JRosenkranz commited on Apr 5

Commit

2532bba

•

1 Parent(s): 924f16b

updated docker run

Files changed (1) hide show

README.md +11 -1

README.md CHANGED Viewed

@@ -7,6 +7,7 @@ To try this out running in a production-like environment, please use the pre-bui
 ```bash
 docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7 docker run -d --rm --gpus all \
     --name my-tgis-server \
     -v /path/to/all/models:/models \
     -e MODEL_NAME=/models/model_weights/llama/13B-F \
     -e SPECULATOR_PATH=/models/speculator_weights/llama/13B-F \
@@ -15,8 +16,17 @@ docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-loca
     -e DTYPE_STR=float16 \
     docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
 docker logs my-tgis-server -f
-docker exec -it my-tgis-server python /path-to-example-code/sample_client.py
 ```
 To try this out with the fms-native compiled model, please execute the following:

 ```bash
 docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7 docker run -d --rm --gpus all \
     --name my-tgis-server \
+    -p 8033:8033 \
     -v /path/to/all/models:/models \
     -e MODEL_NAME=/models/model_weights/llama/13B-F \
     -e SPECULATOR_PATH=/models/speculator_weights/llama/13B-F \
     -e DTYPE_STR=float16 \
     docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
+# check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
 docker logs my-tgis-server -f
+# get the client sample (Note: The first prompt will take longer as there is a warmup time)
+conda create -n tgis-env python=3.11
+conda activate tgis-env
+git clone --branch speculative-decoding --single-branch https://github.com/tdoublep/text-generation-inference.git
+cd text-generation-inference/integration_tests
+make gen-client
+pip install . --no-cache-dir
+python sample_client.py
 ```
 To try this out with the fms-native compiled model, please execute the following: