ibm-granite
/

granite-7b-instruct-accelerator

Transformers

Safetensors

mlp_speculator

Inference Endpoints

Model card Files Files and versions Community

JRosenkranz commited on May 20

Commit

d95c54e

•

1 Parent(s): 0dae012

Update README.md

Browse files

Files changed (1) hide show

README.md +29 -8

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-license: llama2
 ---
 ## Installation from source
@@ -33,7 +33,7 @@ Training is light-weight and can be completed in only a few days depending on ba
 _Note: For all samples, your environment must have access to cuda_
-### Production Server Sample
 *To try this out running in a production-like environment, please use the pre-built docker image:*
@@ -43,7 +43,7 @@ _Note: For all samples, your environment must have access to cuda_
 HF_HUB_CACHE=/hf_hub_cache
 chmod a+w $HF_HUB_CACHE
 HF_HUB_TOKEN="your huggingface hub token"
-TGIS_IMAGE=quay.io/wxpe/text-gen-server:main.ee927a4
 docker pull $TGIS_IMAGE
@@ -54,7 +54,7 @@ docker run --rm \
     -e TRANSFORMERS_CACHE=/models \
     $TGIS_IMAGE \
     text-generation-server download-weights \
-    instructlab/granite-7b-lab \
     --token $HF_HUB_TOKEN
 # optionally download the speculator model if the weights do not already exist
@@ -74,7 +74,7 @@ docker run -d --rm --gpus all \
     -v $HF_HUB_CACHE:/models \
     -e HF_HUB_CACHE=/models \
     -e TRANSFORMERS_CACHE=/models \
-    -e MODEL_NAME=instructlab/granite-7b-lab \
     -e SPECULATOR_NAME=ibm-granite/granite-7b-lab-accelerator \
     -e FLASH_ATTENTION=true \
     -e PAGED_ATTENTION=true \
@@ -101,6 +101,27 @@ python sample_client.py
 _Note: first prompt may be slower as there is a slight warmup time_
 ### Minimal Sample
 *To try this out with the fms-native compiled model, please execute the following:*
@@ -118,7 +139,7 @@ pip install transformers==4.35.0 sentencepiece numpy
 ##### batch_size=1 (compile + cudagraphs)
 ```bash
-MODEL_PATH=/path/to/instructlab/granite-7b-lab
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=7b.ibm_instruct_lab \
     --model_path=$MODEL_PATH \
@@ -135,7 +156,7 @@ python fms-extras/scripts/paged_speculative_inference.py \
 ##### batch_size=1 (compile)
 ```bash
-MODEL_PATH=/path/to/instructlab/granite-7b-lab
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=7b.ibm_instruct_lab \
     --model_path=$MODEL_PATH \
@@ -151,7 +172,7 @@ python fms-extras/scripts/paged_speculative_inference.py \
 ##### batch_size=4 (compile)
 ```bash
-MODEL_PATH=/path/to/instructlab/granite-7b-lab
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=7b.ibm_instruct_lab \
     --model_path=$MODEL_PATH \

 ---
+license: apache-2.0
 ---
 ## Installation from source
 _Note: For all samples, your environment must have access to cuda_
+### Use in IBM Production TGIS
 *To try this out running in a production-like environment, please use the pre-built docker image:*
 HF_HUB_CACHE=/hf_hub_cache
 chmod a+w $HF_HUB_CACHE
 HF_HUB_TOKEN="your huggingface hub token"
+TGIS_IMAGE=quay.io/wxpe/text-gen-server:main.ddc56ee
 docker pull $TGIS_IMAGE
     -e TRANSFORMERS_CACHE=/models \
     $TGIS_IMAGE \
     text-generation-server download-weights \
+    ibm-granite/granite-7b-lab \
     --token $HF_HUB_TOKEN
 # optionally download the speculator model if the weights do not already exist
     -v $HF_HUB_CACHE:/models \
     -e HF_HUB_CACHE=/models \
     -e TRANSFORMERS_CACHE=/models \
+    -e MODEL_NAME=ibm-granite/granite-7b-lab \
     -e SPECULATOR_NAME=ibm-granite/granite-7b-lab-accelerator \
     -e FLASH_ATTENTION=true \
     -e PAGED_ATTENTION=true \
 _Note: first prompt may be slower as there is a slight warmup time_
+### Use in Huggingface TGI
+#### start the server
+```bash
+model=ibm-granite/granite-7b-lab-accelerator
+volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
+```
+_note: for tensor parallel, add --num-shard_
+#### make a request
+```bash
+curl 127.0.0.1:8080/generate_stream \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
+    -H 'Content-Type: application/json'
+```
 ### Minimal Sample
 *To try this out with the fms-native compiled model, please execute the following:*
 ##### batch_size=1 (compile + cudagraphs)
 ```bash
+MODEL_PATH=/path/to/ibm-granite/granite-7b-lab
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=7b.ibm_instruct_lab \
     --model_path=$MODEL_PATH \
 ##### batch_size=1 (compile)
 ```bash
+MODEL_PATH=/path/to/ibm-granite/granite-7b-lab
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=7b.ibm_instruct_lab \
     --model_path=$MODEL_PATH \
 ##### batch_size=4 (compile)
 ```bash
+MODEL_PATH=/path/to/ibm-granite/granite-7b-lab
 python fms-extras/scripts/paged_speculative_inference.py \
     --variant=7b.ibm_instruct_lab \
     --model_path=$MODEL_PATH \