ibm-fms
/

llama-13b-accelerator

Transformers

Safetensors

mlp_speculator

Inference Endpoints

Model card Files Files and versions Community

Joshua Rosenkranz commited on Apr 17

Commit

9fca77e

•

2 Parent(s): 7a02f91 fbbff12

Merge branch 'main' of https://huggingface.co/ibm-fms/llama-13b-accelerator

Browse files

Files changed (1) hide show

README.md +23 -17

README.md CHANGED Viewed

@@ -5,17 +5,20 @@ license: llama2
 ## Description
 This model is intended to be used as an accelerator for [llama 13B (chat)](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and takes inspiration
-from the Medusa architecture and modifies the MLP into a multi-stage MLP, where each stage predicts
-a single token in the draft. Each stage takes as input both a state vector and sampled token embedding
-from the prior stage (the base model can be considered stage 0). The inputs are projected and passed
-through a LayerNorm/GeLU activation, forming a new state vector. This state vector is used to predict
-the next draft token, which, with the new state vector, acts as input for the next stage of prediction.
-We sample multiple tokens at each stage, and emit a tree of candidate suffixes to evaluate in parallel.
-## Code
-- Paged Attention KV-Cache / Speculator Implementations: https://github.com/foundation-model-stack/fms-extras
-- Production Server with speculative decoding implementation: https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
 ## Samples
@@ -28,23 +31,24 @@ _Note: For all samples, your environment must have access to cuda_
 #### Setup
 ```bash
-docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7 docker run -d --rm --gpus all \
     --name my-tgis-server \
     -p 8033:8033 \
     -v /path/to/all/models:/models \
     -e MODEL_NAME=/models/model_weights/llama/13B-F \
-    -e SPECULATOR_PATH=/models/speculator_weights/llama/13B-F \
     -e FLASH_ATTENTION=true \
     -e PAGED_ATTENTION=true \
     -e DTYPE_STR=float16 \
-    docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
 # check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
 docker logs my-tgis-server -f
 # get the client sample (Note: The first prompt will take longer as there is a warmup time)
-conda create -n tgis-env python=3.11
-conda activate tgis-env
 git clone --branch speculative-decoding --single-branch https://github.com/tdoublep/text-generation-inference.git
 cd text-generation-inference/integration_tests
 make gen-client
@@ -57,6 +61,8 @@ pip install . --no-cache-dir
 python sample_client.py
 ```
 ### Minimal Sample
 *To try this out with the fms-native compiled model, please execute the following:*
@@ -65,9 +71,7 @@ python sample_client.py
 ```bash
 git clone https://github.com/foundation-model-stack/fms-extras
-git clone https://github.com/foundation-model-stack/foundation-model-stack
 (cd fms-extras && pip install -e .)
-(cd foundation-model-stack && pip install -e .)
 pip install transformers==4.35.0 sentencepiece numpy
 ```
@@ -112,4 +116,6 @@ python fms-extras/scripts/paged_speculative_inference.py \
     --speculator_source=hf \
     --batch_input \
     --compile \
-```

 ## Description
 This model is intended to be used as an accelerator for [llama 13B (chat)](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and takes inspiration
+from the Medusa speculative decoding architecture. This accelerator modifies the MLP into a multi-stage MLP, where each stage predicts
+a single token in the draft based on both a state vector and sampled token
+from the prior stage (the base model can be considered stage 0).
+The state vector from the base model provides contextual information to the accelerator,
+while conditioning on prior sampled tokens allows it to produce higher-quality draft n-grams.
+Note: The underlying MLP speculator is a generic architecture that can be trained with any generative model to accelerate inference.
+Training is light-weight and can be completed in only a few days depending on base model size and speed.
+## Repository Links
+1. [Paged Attention KV-Cache / Speculator](https://github.com/foundation-model-stack/fms-extras)
+2. [Production Server with speculative decoding](https://github.com/IBM/text-generation-inference/pull/78)
+3. [Speculator training](https://github.com/foundation-model-stack/fms-fsdp/pull/35)
 ## Samples
 #### Setup
 ```bash
 docker pull quay.io/wxpe/text-gen-server:speculative-decoding.7 docker run -d --rm --gpus all \ecd73c4
+docker run -d --rm --gpus all \
     --name my-tgis-server \
     -p 8033:8033 \
     -v /path/to/all/models:/models \
     -e MODEL_NAME=/models/model_weights/llama/13B-F \
+    -e SPECULATOR_NAME=/models/speculator_weights/llama/llama-13b-accelerator \
     -e FLASH_ATTENTION=true \
     -e PAGED_ATTENTION=true \
     -e DTYPE_STR=float16 \
+    quay.io/wxpe/text-gen-server:speculative-decoding.ecd73c4
 # check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
 docker logs my-tgis-server -f
 # get the client sample (Note: The first prompt will take longer as there is a warmup time)
+conda create -n tgis-client-env python=3.11
+conda activate tgis-client-env
 git clone --branch speculative-decoding --single-branch https://github.com/tdoublep/text-generation-inference.git
 cd text-generation-inference/integration_tests
 make gen-client
 python sample_client.py
 ```
+_Note: first prompt may be slower as there is a slight warmup time_
 ### Minimal Sample
 *To try this out with the fms-native compiled model, please execute the following:*
 ```bash
 git clone https://github.com/foundation-model-stack/fms-extras
 (cd fms-extras && pip install -e .)
 pip install transformers==4.35.0 sentencepiece numpy
 ```
     --speculator_source=hf \
     --batch_input \
     --compile \
+```
+Sample code can be found [here](https://github.com/foundation-model-stack/fms-extras/blob/main/scripts/paged_speculative_inference.py)