cecibas commited on
Commit
ec21902
1 Parent(s): 9e1f53d

Upload 7 files

Browse files
README.md CHANGED
@@ -1,3 +1,184 @@
1
- ---
2
- license: llama2
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama2
3
+ ---
4
+
5
+ ## Installation from source
6
+
7
+ ```bash
8
+ git clone https://github.com/foundation-model-stack/fms-extras
9
+ cd fms-extras
10
+ pip install -e .
11
+ ```
12
+
13
+
14
+ ## Description
15
+
16
+ This model is intended to be used as an accelerator for [llama 13B (chat)](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and takes inspiration
17
+ from the Medusa speculative decoding architecture. This accelerator modifies the MLP into a multi-stage MLP, where each stage predicts
18
+ a single token in the draft based on both a state vector and sampled token
19
+ from the prior stage (the base model can be considered stage 0).
20
+ The state vector from the base model provides contextual information to the accelerator,
21
+ while conditioning on prior sampled tokens allows it to produce higher-quality draft n-grams.
22
+
23
+ Note: The underlying MLP speculator is a generic architecture that can be trained with any generative model to accelerate inference.
24
+ Training is light-weight and can be completed in only a few days depending on base model size and speed.
25
+
26
+ ## Repository Links
27
+
28
+ 1. [Paged Attention KV-Cache / Speculator](https://github.com/foundation-model-stack/fms-extras)
29
+ 2. [Production Server with speculative decoding](https://github.com/IBM/text-generation-inference.git)
30
+ 3. [Speculator training](https://github.com/foundation-model-stack/fms-fsdp/pull/35)
31
+
32
+ ## Samples
33
+
34
+ _Note: For all samples, your environment must have access to cuda_
35
+
36
+ ### Use in IBM Production TGIS
37
+
38
+ *To try this out running in a production-like environment, please use the pre-built docker image:*
39
+
40
+ #### Setup
41
+
42
+ ```bash
43
+ HF_HUB_CACHE=/hf_hub_cache
44
+ chmod a+w $HF_HUB_CACHE
45
+ HF_HUB_TOKEN="your huggingface hub token"
46
+ TGIS_IMAGE=quay.io/wxpe/text-gen-server:main.ddc56ee
47
+
48
+ docker pull $TGIS_IMAGE
49
+
50
+ # optionally download llama-2-13b-chat if the weights do not already exist
51
+ docker run --rm \
52
+ -v $HF_HUB_CACHE:/models \
53
+ -e HF_HUB_CACHE=/models \
54
+ -e TRANSFORMERS_CACHE=/models \
55
+ $TGIS_IMAGE \
56
+ text-generation-server download-weights \
57
+ meta-llama/Llama-2-13b-chat-hf \
58
+ --token $HF_HUB_TOKEN
59
+
60
+ # optionally download the speculator model if the weights do not already exist
61
+ docker run --rm \
62
+ -v $HF_HUB_CACHE:/models \
63
+ -e HF_HUB_CACHE=/models \
64
+ -e TRANSFORMERS_CACHE=/models \
65
+ $TGIS_IMAGE \
66
+ text-generation-server download-weights \
67
+ ibm-fms/llama-13b-accelerator \
68
+ --token $HF_HUB_TOKEN
69
+
70
+ # note: if the weights were downloaded separately (not with the above commands), please place them in the HF_HUB_CACHE directory and refer to them with /models/<model_name>
71
+ docker run -d --rm --gpus all \
72
+ --name my-tgis-server \
73
+ -p 8033:8033 \
74
+ -v $HF_HUB_CACHE:/models \
75
+ -e HF_HUB_CACHE=/models \
76
+ -e TRANSFORMERS_CACHE=/models \
77
+ -e MODEL_NAME=meta-llama/Llama-2-13b-chat-hf \
78
+ -e SPECULATOR_NAME=ibm-fms/llama-13b-accelerator \
79
+ -e FLASH_ATTENTION=true \
80
+ -e PAGED_ATTENTION=true \
81
+ -e DTYPE=float16 \
82
+ $TGIS_IMAGE
83
+
84
+ # check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
85
+ docker logs my-tgis-server -f
86
+
87
+ # get the client sample (Note: The first prompt will take longer as there is a warmup time)
88
+ conda create -n tgis-client-env python=3.11
89
+ conda activate tgis-client-env
90
+ git clone --branch main --single-branch https://github.com/IBM/text-generation-inference.git
91
+ cd text-generation-inference/integration_tests
92
+ make gen-client
93
+ pip install . --no-cache-dir
94
+ ```
95
+
96
+ #### Run Sample
97
+
98
+ ```bash
99
+ python sample_client.py
100
+ ```
101
+
102
+ _Note: first prompt may be slower as there is a slight warmup time_
103
+
104
+ ### Use in Huggingface TGI
105
+
106
+ #### start the server
107
+
108
+ ```bash
109
+ model=ibm-fms/llama-13b-accelerator
110
+ volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
111
+ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
112
+ ```
113
+
114
+ _note: for tensor parallel, add --num-shard_
115
+
116
+ #### make a request
117
+
118
+ ```bash
119
+ curl 127.0.0.1:8080/generate_stream \
120
+ -X POST \
121
+ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
122
+ -H 'Content-Type: application/json'
123
+ ```
124
+
125
+ ### Minimal Sample
126
+
127
+ *To try this out with the fms-native compiled model, please execute the following:*
128
+
129
+ #### Install
130
+
131
+ ```bash
132
+ git clone https://github.com/foundation-model-stack/fms-extras
133
+ (cd fms-extras && pip install -e .)
134
+ pip install transformers==4.35.0 sentencepiece numpy
135
+ ```
136
+
137
+ #### Run Sample
138
+
139
+ ##### batch_size=1 (compile + cudagraphs)
140
+
141
+ ```bash
142
+ MODEL_PATH=/path/to/llama/hf/13B-F
143
+ python fms-extras/scripts/paged_speculative_inference.py \
144
+ --variant=13b \
145
+ --model_path=$MODEL_PATH \
146
+ --model_source=hf \
147
+ --tokenizer=$MODEL_PATH \
148
+ --speculator_path=ibm-fms/llama-13b-accelerator \
149
+ --speculator_source=hf \
150
+ --speculator_variant=840m \
151
+ --compile \
152
+ --compile_mode=reduce-overhead
153
+ ```
154
+
155
+ ##### batch_size=1 (compile)
156
+
157
+ ```bash
158
+ MODEL_PATH=/path/to/llama/hf/13B-F
159
+ python fms-extras/scripts/paged_speculative_inference.py \
160
+ --variant=13b \
161
+ --model_path=$MODEL_PATH \
162
+ --model_source=hf \
163
+ --tokenizer=$MODEL_PATH \
164
+ --speculator_path=ibm-fms/llama-13b-accelerator \
165
+ --speculator_source=hf \
166
+ --speculator_variant=840m \
167
+ --compile
168
+ ```
169
+
170
+ ##### batch_size=4 (compile)
171
+
172
+ ```bash
173
+ MODEL_PATH=/path/to/llama/hf/13B-F
174
+ python fms-extras/scripts/paged_speculative_inference.py \
175
+ --variant=13b \
176
+ --model_path=$MODEL_PATH \
177
+ --model_source=hf \
178
+ --tokenizer=$MODEL_PATH \
179
+ --speculator_path=ibm-fms/llama-13b-accelerator \
180
+ --speculator_source=hf \
181
+ --speculator_variant=840m \
182
+ --batch_input \
183
+ --compile
184
+ ```
config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "base_model_name_or_path": "meta-llama/Llama-2-13b-chat-hf",
3
+ "architectures": [
4
+ "MLPSpeculatorPreTrainedModel"
5
+ ],
6
+ "emb_dim": 5120,
7
+ "inner_dim": 4096,
8
+ "model_type": "mlp_speculator",
9
+ "n_candidates": 5,
10
+ "n_predict": 3,
11
+ "top_k_tokens_per_head": [
12
+ 5,
13
+ 3,
14
+ 2
15
+ ],
16
+ "torch_dtype": "float16",
17
+ "transformers_version": "4.35.0",
18
+ "vocab_size": 32000
19
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de321baca8edf17872498e6ce2721870745f41baf33473a49717a52fa4a79f28
3
+ size 1681966568
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<unk>",
5
+ "lstrip": false,
6
+ "normalized": true,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": true,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": true,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ }
27
+ },
28
+ "bos_token": "<s>",
29
+ "clean_up_tokenization_spaces": false,
30
+ "eos_token": "</s>",
31
+ "model_max_length": 1000000000000000019884624838656,
32
+ "pad_token": null,
33
+ "sp_model_kwargs": {},
34
+ "tokenizer_class": "LlamaTokenizer",
35
+ "unk_token": "<unk>",
36
+ "use_default_system_prompt": false
37
+ }