ibm-fms
/

llama-13b-accelerator

Inference Endpoints

Model card Files Files and versions Community

JRosenkranz commited on Apr 5

Commit

f4c8757

•

1 Parent(s): 593ddda

Update README.md

Files changed (1) hide show

README.md +10 -0

README.md CHANGED Viewed

@@ -6,12 +6,22 @@ license: llama2
 This model as intended to be used as an accelerator for llama 13B (chat).
 Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
 Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
 ## Samples
 ### Production Server Sample
 *To try this out running in a production-like environment, please use the pre-built docker image:*

 This model as intended to be used as an accelerator for llama 13B (chat).
+It takes inspiration from the Medusa architecture and modifies the MLP into a multi-stage MLP,
+where each stage predicts a single token in the draft. Each stage takes as input both a state
+vector and sampled token embedding from the prior stage (the base model can be considered
+stage 0). The inputs are projected and passed through a LayerNorm/GeLU activation, forming a
+new state vector. This state vector is used to predict the next draft token, which, with the new
+state vector, acts as input for the next stage of prediction. We sample multiple tokens at each
+stage, and emit a tree of candidate suffixes to evaluate in parallel.
 Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
 Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
 ## Samples
+_Note: For all samples, your environment must have access to cuda_
 ### Production Server Sample
 *To try this out running in a production-like environment, please use the pre-built docker image:*