sahilsuneja commited on
Commit
057ee41
1 Parent(s): 859b3b2

adding model weights

Browse files
README.md CHANGED
@@ -1,3 +1,143 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ ## Description
6
+
7
+ This model is intended to be used as an accelerator for [Granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) and takes inspiration from the Medusa speculative decoding architecture.
8
+ This accelerator modifies the MLP into a multi-stage MLP, where each stage predicts
9
+ a single token in the draft based on both a state vector and sampled token
10
+ from the prior stage (the base model can be considered stage 0).
11
+ The state vector from the base model provides contextual information to the accelerator,
12
+ while conditioning on prior sampled tokens allows it to produce higher-quality draft n-grams.
13
+
14
+ Note: The underlying MLP speculator is a generic architecture that can be trained with any generative model to accelerate inference.
15
+ Training is light-weight and can be completed in only a few days depending on base model size and speed.
16
+
17
+ ## Repository Links
18
+
19
+ 1. [Paged Attention KV-Cache / Speculator](https://github.com/foundation-model-stack/fms-extras)
20
+ 2. [Production Server with speculative decoding](https://github.com/IBM/text-generation-inference.git)
21
+ 3. [Speculator training](https://github.com/foundation-model-stack/fms-fsdp.git)
22
+
23
+ ## Samples
24
+
25
+ _Note: For all samples, your environment must have access to cuda_
26
+
27
+ ### Use in IBM Production TGIS
28
+
29
+ *To try this out running in a production-like environment, please use the pre-built docker image:*
30
+
31
+ #### Setup
32
+
33
+ ```bash
34
+ HF_HUB_CACHE=/hf_hub_cache
35
+ chmod a+w $HF_HUB_CACHE
36
+ HF_HUB_TOKEN="your huggingface hub token"
37
+ TGIS_IMAGE=quay.io/wxpe/text-gen-server:main.ddc56ee
38
+
39
+ docker pull $TGIS_IMAGE
40
+
41
+ # optionally download granite-3.0-8b-instruct if the weights do not already exist
42
+ docker run --rm \
43
+ -v $HF_HUB_CACHE:/models \
44
+ -e HF_HUB_CACHE=/models \
45
+ -e TRANSFORMERS_CACHE=/models \
46
+ $TGIS_IMAGE \
47
+ text-generation-server download-weights \
48
+ ibm-granite/granite-3.0-8b-instruct \
49
+ --token $HF_HUB_TOKEN
50
+
51
+ # optionally download the speculator model if the weights do not already exist
52
+ docker run --rm \
53
+ -v $HF_HUB_CACHE:/models \
54
+ -e HF_HUB_CACHE=/models \
55
+ -e TRANSFORMERS_CACHE=/models \
56
+ $TGIS_IMAGE \
57
+ text-generation-server download-weights \
58
+ ibm-granite/granite-3.0-8b-instruct-accelerator \
59
+ --token $HF_HUB_TOKEN
60
+
61
+ # note: if the weights were downloaded separately (not with the above commands), please place them in the HF_HUB_CACHE directory and refer to them with /models/<model_name>
62
+ docker run -d --rm --gpus all \
63
+ --name my-tgis-server \
64
+ -p 8033:8033 \
65
+ -v $HF_HUB_CACHE:/models \
66
+ -e HF_HUB_CACHE=/models \
67
+ -e TRANSFORMERS_CACHE=/models \
68
+ -e MODEL_NAME=ibm-granite/granite-3.0-8b-instruct \
69
+ -e SPECULATOR_NAME=ibm-granite/granite-3.0-8b-instruct-accelerator \
70
+ -e FLASH_ATTENTION=true \
71
+ -e PAGED_ATTENTION=true \
72
+ -e DTYPE=float16 \
73
+ $TGIS_IMAGE
74
+
75
+ # check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
76
+ docker logs my-tgis-server -f
77
+
78
+ # get the client sample (Note: The first prompt will take longer as there is a warmup time)
79
+ conda create -n tgis-client-env python=3.11
80
+ conda activate tgis-client-env
81
+ git clone --branch main --single-branch https://github.com/IBM/text-generation-inference.git
82
+ cd text-generation-inference/integration_tests
83
+ make gen-client
84
+ pip install . --no-cache-dir
85
+ ```
86
+
87
+ #### Run Sample
88
+
89
+ ```bash
90
+ python sample_client.py
91
+ ```
92
+
93
+ _Note: first prompt may be slower as there is a slight warmup time_
94
+
95
+ ### Use in Huggingface TGI
96
+
97
+ #### start the server
98
+
99
+ ```bash
100
+ model=ibm-granite/granite-3.0-8b-instruct-accelerator
101
+ volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
102
+
103
+ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
104
+ ```
105
+
106
+ _note: for tensor parallel, add --num-shard_
107
+
108
+ #### make a request
109
+
110
+ ```bash
111
+ curl 127.0.0.1:8080/generate_stream \
112
+ -X POST \
113
+ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
114
+ -H 'Content-Type: application/json'
115
+ ```
116
+
117
+ ### Use in vLLM
118
+ ```from vllm import LLM, SamplingParams
119
+
120
+ # Sample prompts.
121
+ prompts = [
122
+ "The president of the United States is",
123
+ ]
124
+ # Create a sampling params object.
125
+ sampling_params = SamplingParams(temperature=0.0)
126
+
127
+ # Create an LLM.
128
+ llm = LLM(
129
+ model="/path/to/granite-3.0-8b-instruct",
130
+ tensor_parallel_size=4,
131
+ speculative_model="/path/to/granite-3.0-8b-instruct-accelerator",
132
+ speculative_draft_tensor_parallel_size=1,
133
+ use_v2_block_manager=True,
134
+ )
135
+ # Generate texts from the prompts. The output is a list of RequestOutput objects
136
+ # that contain the prompt, generated text, and other information.
137
+ outputs = llm.generate(prompts, sampling_params)
138
+ # Print the outputs.
139
+ for output in outputs:
140
+ prompt = output.prompt
141
+ generated_text = output.outputs[0].text
142
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
143
+ ```
added_tokens.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "<|end_of_role|>": 49153,
3
+ "<|start_of_role|>": 49152,
4
+ "<|tool_call|>": 49154
5
+ }
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MLPSpeculatorPreTrainedModel"
4
+ ],
5
+ "emb_dim": 4096,
6
+ "inner_dim": 4096,
7
+ "model_type": "mlp_speculator",
8
+ "n_candidates": 4,
9
+ "n_predict": 4,
10
+ "scale_input": true,
11
+ "tie_weights": true,
12
+ "top_k_tokens_per_head": [
13
+ 4,
14
+ 3,
15
+ 2,
16
+ 2
17
+ ],
18
+ "torch_dtype": "bfloat16",
19
+ "transformers_version": "4.41.2",
20
+ "vocab_size": 49155
21
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be9dd56e96508c2592ac857ceea862ea29f674a6cba6910bb23ba4462bb2ee13
3
+ size 3355712106
special_tokens_map.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|start_of_role|>",
4
+ "<|end_of_role|>",
5
+ "<|tool_call|>"
6
+ ],
7
+ "bos_token": {
8
+ "content": "<|end_of_text|>",
9
+ "lstrip": false,
10
+ "normalized": false,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "eos_token": {
15
+ "content": "<|end_of_text|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "pad_token": {
22
+ "content": "<|end_of_text|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "unk_token": {
29
+ "content": "<|end_of_text|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ }
35
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<|end_of_text|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<fim_prefix>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "<fim_middle>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "3": {
30
+ "content": "<fim_suffix>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "4": {
38
+ "content": "<fim_pad>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "5": {
46
+ "content": "<filename>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "6": {
54
+ "content": "<gh_stars>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "7": {
62
+ "content": "<issue_start>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "8": {
70
+ "content": "<issue_comment>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "9": {
78
+ "content": "<issue_closed>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "10": {
86
+ "content": "<jupyter_start>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "11": {
94
+ "content": "<jupyter_text>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "12": {
102
+ "content": "<jupyter_code>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "13": {
110
+ "content": "<jupyter_output>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "14": {
118
+ "content": "<empty_output>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ },
125
+ "15": {
126
+ "content": "<commit_before>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": true
132
+ },
133
+ "16": {
134
+ "content": "<commit_msg>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": true
140
+ },
141
+ "17": {
142
+ "content": "<commit_after>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": true
148
+ },
149
+ "18": {
150
+ "content": "<reponame>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": true
156
+ },
157
+ "49152": {
158
+ "content": "<|start_of_role|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": true
164
+ },
165
+ "49153": {
166
+ "content": "<|end_of_role|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": true
172
+ },
173
+ "49154": {
174
+ "content": "<|tool_call|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": true
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|start_of_role|>",
184
+ "<|end_of_role|>",
185
+ "<|tool_call|>"
186
+ ],
187
+ "bos_token": "<|end_of_text|>",
188
+ "chat_template": "{%- if tools %}\n {{- '<|start_of_role|>available_tools<|end_of_role|>\n' }}\n {%- for tool in tools %}\n {{- tool | tojson(indent=4) }}\n {%- if not loop.last %}\n {{- '\n\n' }}\n {%- endif %}\n {%- endfor %}\n {{- '<|end_of_text|>\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if message['role'] == 'system' %}\n {{- '<|start_of_role|>system<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'user' %}\n {{- '<|start_of_role|>user<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'assistant' %}\n {{- '<|start_of_role|>assistant<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'assistant_tool_call' %}\n {{- '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'tool_response' %}\n {{- '<|start_of_role|>tool_response<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- endif %}\n {%- if loop.last and add_generation_prompt %}\n {{- '<|start_of_role|>assistant<|end_of_role|>' }}\n {%- endif %}\n{%- endfor %}",
189
+ "clean_up_tokenization_spaces": true,
190
+ "eos_token": "<|end_of_text|>",
191
+ "errors": "replace",
192
+ "model_max_length": 9223372036854775807,
193
+ "pad_token": "<|end_of_text|>",
194
+ "padding_side": "left",
195
+ "tokenizer_class": "GPT2Tokenizer",
196
+ "unk_token": "<|end_of_text|>",
197
+ "vocab_size": 49152
198
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff