ibm-fms
/

llama-13b-accelerator

Inference Endpoints

Model card Files Files and versions Community

JRosenkranz commited on Apr 5

Commit

c8b3371

•

1 Parent(s): f4c8757

Update README.md

Files changed (1) hide show

README.md +12 -13

README.md CHANGED Viewed

@@ -4,19 +4,18 @@ license: llama2
 ## Description
-This model as intended to be used as an accelerator for llama 13B (chat).
-It takes inspiration from the Medusa architecture and modifies the MLP into a multi-stage MLP,
-where each stage predicts a single token in the draft. Each stage takes as input both a state
-vector and sampled token embedding from the prior stage (the base model can be considered
-stage 0). The inputs are projected and passed through a LayerNorm/GeLU activation, forming a
-new state vector. This state vector is used to predict the next draft token, which, with the new
-state vector, acts as input for the next stage of prediction. We sample multiple tokens at each
-stage, and emit a tree of candidate suffixes to evaluate in parallel.
-Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
-Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
 ## Samples

 ## Description
+This model is intended to be used as an accelerator for llama 13B (chat) and takes inspiration
+from the Medusa architecture and modifies the MLP into a multi-stage MLP, where each stage predicts
+a single token in the draft. Each stage takes as input both a state vector and sampled token embedding
+from the prior stage (the base model can be considered stage 0). The inputs are projected and passed
+through a LayerNorm/GeLU activation, forming a new state vector. This state vector is used to predict
+the next draft token, which, with the new state vector, acts as input for the next stage of prediction.
+We sample multiple tokens at each stage, and emit a tree of candidate suffixes to evaluate in parallel.
+## Code
+- Paged Attention KV-Cache / Speculator Implementations: https://github.com/foundation-model-stack/fms-extras
+- Production Server with speculative decoding implementation: https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
 ## Samples