JRosenkranz
commited on
Commit
•
593ddda
1
Parent(s):
a24b598
updated samples
Browse files
README.md
CHANGED
@@ -10,8 +10,13 @@ This model as intended to be used as an accelerator for llama 13B (chat).
|
|
10 |
Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
|
11 |
Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
|
12 |
|
|
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
15 |
|
16 |
```bash
|
17 |
docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
docker run -d --rm --gpus all \
|
@@ -35,17 +40,31 @@ git clone --branch speculative-decoding --single-branch https://github.com/tdoub
|
|
35 |
cd text-generation-inference/integration_tests
|
36 |
make gen-client
|
37 |
pip install . --no-cache-dir
|
|
|
|
|
|
|
|
|
|
|
38 |
python sample_client.py
|
39 |
```
|
40 |
|
41 |
-
|
|
|
|
|
42 |
|
43 |
-
####
|
44 |
|
45 |
```bash
|
46 |
git clone https://github.com/foundation-model-stack/fms-extras
|
47 |
(cd fms-extras && pip install -e .)
|
48 |
pip install transformers==4.35.0 sentencepiece numpy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
python fms-extras/scripts/paged_speculative_inference.py \
|
50 |
--variant=13b \
|
51 |
--model_path=/path/to/model_weights/llama/13B-F \
|
@@ -57,12 +76,9 @@ python fms-extras/scripts/paged_speculative_inference.py \
|
|
57 |
--compile_mode=reduce-overhead
|
58 |
```
|
59 |
|
60 |
-
|
61 |
|
62 |
```bash
|
63 |
-
git clone https://github.com/foundation-model-stack/fms-extras
|
64 |
-
(cd fms-extras && pip install -e .)
|
65 |
-
pip install transformers==4.35.0 sentencepiece numpy
|
66 |
python fms-extras/scripts/paged_speculative_inference.py \
|
67 |
--variant=13b \
|
68 |
--model_path=/path/to/model_weights/llama/13B-F \
|
@@ -73,12 +89,9 @@ python fms-extras/scripts/paged_speculative_inference.py \
|
|
73 |
--compile \
|
74 |
```
|
75 |
|
76 |
-
|
77 |
|
78 |
```bash
|
79 |
-
git clone https://github.com/foundation-model-stack/fms-extras
|
80 |
-
(cd fms-extras && pip install -e .)
|
81 |
-
pip install transformers==4.35.0 sentencepiece numpy
|
82 |
python fms-extras/scripts/paged_speculative_inference.py \
|
83 |
--variant=13b \
|
84 |
--model_path=/path/to/model_weights/llama/13B-F \
|
|
|
10 |
Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
|
11 |
Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding
|
12 |
|
13 |
+
## Samples
|
14 |
|
15 |
+
### Production Server Sample
|
16 |
+
|
17 |
+
*To try this out running in a production-like environment, please use the pre-built docker image:*
|
18 |
+
|
19 |
+
#### Setup
|
20 |
|
21 |
```bash
|
22 |
docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
docker run -d --rm --gpus all \
|
|
|
40 |
cd text-generation-inference/integration_tests
|
41 |
make gen-client
|
42 |
pip install . --no-cache-dir
|
43 |
+
```
|
44 |
+
|
45 |
+
#### Run Sample
|
46 |
+
|
47 |
+
```bash
|
48 |
python sample_client.py
|
49 |
```
|
50 |
|
51 |
+
### Minimal Sample
|
52 |
+
|
53 |
+
*To try this out with the fms-native compiled model, please execute the following:*
|
54 |
|
55 |
+
#### Install
|
56 |
|
57 |
```bash
|
58 |
git clone https://github.com/foundation-model-stack/fms-extras
|
59 |
(cd fms-extras && pip install -e .)
|
60 |
pip install transformers==4.35.0 sentencepiece numpy
|
61 |
+
```
|
62 |
+
|
63 |
+
#### Run Sample
|
64 |
+
|
65 |
+
##### batch_size=1 (compile + cudagraphs)
|
66 |
+
|
67 |
+
```bash
|
68 |
python fms-extras/scripts/paged_speculative_inference.py \
|
69 |
--variant=13b \
|
70 |
--model_path=/path/to/model_weights/llama/13B-F \
|
|
|
76 |
--compile_mode=reduce-overhead
|
77 |
```
|
78 |
|
79 |
+
##### batch_size=1 (compile)
|
80 |
|
81 |
```bash
|
|
|
|
|
|
|
82 |
python fms-extras/scripts/paged_speculative_inference.py \
|
83 |
--variant=13b \
|
84 |
--model_path=/path/to/model_weights/llama/13B-F \
|
|
|
89 |
--compile \
|
90 |
```
|
91 |
|
92 |
+
##### batch_size=4 (compile)
|
93 |
|
94 |
```bash
|
|
|
|
|
|
|
95 |
python fms-extras/scripts/paged_speculative_inference.py \
|
96 |
--variant=13b \
|
97 |
--model_path=/path/to/model_weights/llama/13B-F \
|