sahilsuneja
commited on
Commit
•
057ee41
1
Parent(s):
859b3b2
adding model weights
Browse files- README.md +140 -0
- added_tokens.json +5 -0
- config.json +21 -0
- merges.txt +0 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +35 -0
- tokenizer.json +0 -0
- tokenizer_config.json +198 -0
- vocab.json +0 -0
README.md
CHANGED
@@ -1,3 +1,143 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
|
5 |
+
## Description
|
6 |
+
|
7 |
+
This model is intended to be used as an accelerator for [Granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) and takes inspiration from the Medusa speculative decoding architecture.
|
8 |
+
This accelerator modifies the MLP into a multi-stage MLP, where each stage predicts
|
9 |
+
a single token in the draft based on both a state vector and sampled token
|
10 |
+
from the prior stage (the base model can be considered stage 0).
|
11 |
+
The state vector from the base model provides contextual information to the accelerator,
|
12 |
+
while conditioning on prior sampled tokens allows it to produce higher-quality draft n-grams.
|
13 |
+
|
14 |
+
Note: The underlying MLP speculator is a generic architecture that can be trained with any generative model to accelerate inference.
|
15 |
+
Training is light-weight and can be completed in only a few days depending on base model size and speed.
|
16 |
+
|
17 |
+
## Repository Links
|
18 |
+
|
19 |
+
1. [Paged Attention KV-Cache / Speculator](https://github.com/foundation-model-stack/fms-extras)
|
20 |
+
2. [Production Server with speculative decoding](https://github.com/IBM/text-generation-inference.git)
|
21 |
+
3. [Speculator training](https://github.com/foundation-model-stack/fms-fsdp.git)
|
22 |
+
|
23 |
+
## Samples
|
24 |
+
|
25 |
+
_Note: For all samples, your environment must have access to cuda_
|
26 |
+
|
27 |
+
### Use in IBM Production TGIS
|
28 |
+
|
29 |
+
*To try this out running in a production-like environment, please use the pre-built docker image:*
|
30 |
+
|
31 |
+
#### Setup
|
32 |
+
|
33 |
+
```bash
|
34 |
+
HF_HUB_CACHE=/hf_hub_cache
|
35 |
+
chmod a+w $HF_HUB_CACHE
|
36 |
+
HF_HUB_TOKEN="your huggingface hub token"
|
37 |
+
TGIS_IMAGE=quay.io/wxpe/text-gen-server:main.ddc56ee
|
38 |
+
|
39 |
+
docker pull $TGIS_IMAGE
|
40 |
+
|
41 |
+
# optionally download granite-3.0-8b-instruct if the weights do not already exist
|
42 |
+
docker run --rm \
|
43 |
+
-v $HF_HUB_CACHE:/models \
|
44 |
+
-e HF_HUB_CACHE=/models \
|
45 |
+
-e TRANSFORMERS_CACHE=/models \
|
46 |
+
$TGIS_IMAGE \
|
47 |
+
text-generation-server download-weights \
|
48 |
+
ibm-granite/granite-3.0-8b-instruct \
|
49 |
+
--token $HF_HUB_TOKEN
|
50 |
+
|
51 |
+
# optionally download the speculator model if the weights do not already exist
|
52 |
+
docker run --rm \
|
53 |
+
-v $HF_HUB_CACHE:/models \
|
54 |
+
-e HF_HUB_CACHE=/models \
|
55 |
+
-e TRANSFORMERS_CACHE=/models \
|
56 |
+
$TGIS_IMAGE \
|
57 |
+
text-generation-server download-weights \
|
58 |
+
ibm-granite/granite-3.0-8b-instruct-accelerator \
|
59 |
+
--token $HF_HUB_TOKEN
|
60 |
+
|
61 |
+
# note: if the weights were downloaded separately (not with the above commands), please place them in the HF_HUB_CACHE directory and refer to them with /models/<model_name>
|
62 |
+
docker run -d --rm --gpus all \
|
63 |
+
--name my-tgis-server \
|
64 |
+
-p 8033:8033 \
|
65 |
+
-v $HF_HUB_CACHE:/models \
|
66 |
+
-e HF_HUB_CACHE=/models \
|
67 |
+
-e TRANSFORMERS_CACHE=/models \
|
68 |
+
-e MODEL_NAME=ibm-granite/granite-3.0-8b-instruct \
|
69 |
+
-e SPECULATOR_NAME=ibm-granite/granite-3.0-8b-instruct-accelerator \
|
70 |
+
-e FLASH_ATTENTION=true \
|
71 |
+
-e PAGED_ATTENTION=true \
|
72 |
+
-e DTYPE=float16 \
|
73 |
+
$TGIS_IMAGE
|
74 |
+
|
75 |
+
# check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
|
76 |
+
docker logs my-tgis-server -f
|
77 |
+
|
78 |
+
# get the client sample (Note: The first prompt will take longer as there is a warmup time)
|
79 |
+
conda create -n tgis-client-env python=3.11
|
80 |
+
conda activate tgis-client-env
|
81 |
+
git clone --branch main --single-branch https://github.com/IBM/text-generation-inference.git
|
82 |
+
cd text-generation-inference/integration_tests
|
83 |
+
make gen-client
|
84 |
+
pip install . --no-cache-dir
|
85 |
+
```
|
86 |
+
|
87 |
+
#### Run Sample
|
88 |
+
|
89 |
+
```bash
|
90 |
+
python sample_client.py
|
91 |
+
```
|
92 |
+
|
93 |
+
_Note: first prompt may be slower as there is a slight warmup time_
|
94 |
+
|
95 |
+
### Use in Huggingface TGI
|
96 |
+
|
97 |
+
#### start the server
|
98 |
+
|
99 |
+
```bash
|
100 |
+
model=ibm-granite/granite-3.0-8b-instruct-accelerator
|
101 |
+
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
|
102 |
+
|
103 |
+
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
|
104 |
+
```
|
105 |
+
|
106 |
+
_note: for tensor parallel, add --num-shard_
|
107 |
+
|
108 |
+
#### make a request
|
109 |
+
|
110 |
+
```bash
|
111 |
+
curl 127.0.0.1:8080/generate_stream \
|
112 |
+
-X POST \
|
113 |
+
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
114 |
+
-H 'Content-Type: application/json'
|
115 |
+
```
|
116 |
+
|
117 |
+
### Use in vLLM
|
118 |
+
```from vllm import LLM, SamplingParams
|
119 |
+
|
120 |
+
# Sample prompts.
|
121 |
+
prompts = [
|
122 |
+
"The president of the United States is",
|
123 |
+
]
|
124 |
+
# Create a sampling params object.
|
125 |
+
sampling_params = SamplingParams(temperature=0.0)
|
126 |
+
|
127 |
+
# Create an LLM.
|
128 |
+
llm = LLM(
|
129 |
+
model="/path/to/granite-3.0-8b-instruct",
|
130 |
+
tensor_parallel_size=4,
|
131 |
+
speculative_model="/path/to/granite-3.0-8b-instruct-accelerator",
|
132 |
+
speculative_draft_tensor_parallel_size=1,
|
133 |
+
use_v2_block_manager=True,
|
134 |
+
)
|
135 |
+
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
136 |
+
# that contain the prompt, generated text, and other information.
|
137 |
+
outputs = llm.generate(prompts, sampling_params)
|
138 |
+
# Print the outputs.
|
139 |
+
for output in outputs:
|
140 |
+
prompt = output.prompt
|
141 |
+
generated_text = output.outputs[0].text
|
142 |
+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
143 |
+
```
|
added_tokens.json
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"<|end_of_role|>": 49153,
|
3 |
+
"<|start_of_role|>": 49152,
|
4 |
+
"<|tool_call|>": 49154
|
5 |
+
}
|
config.json
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"MLPSpeculatorPreTrainedModel"
|
4 |
+
],
|
5 |
+
"emb_dim": 4096,
|
6 |
+
"inner_dim": 4096,
|
7 |
+
"model_type": "mlp_speculator",
|
8 |
+
"n_candidates": 4,
|
9 |
+
"n_predict": 4,
|
10 |
+
"scale_input": true,
|
11 |
+
"tie_weights": true,
|
12 |
+
"top_k_tokens_per_head": [
|
13 |
+
4,
|
14 |
+
3,
|
15 |
+
2,
|
16 |
+
2
|
17 |
+
],
|
18 |
+
"torch_dtype": "bfloat16",
|
19 |
+
"transformers_version": "4.41.2",
|
20 |
+
"vocab_size": 49155
|
21 |
+
}
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:be9dd56e96508c2592ac857ceea862ea29f674a6cba6910bb23ba4462bb2ee13
|
3 |
+
size 3355712106
|
special_tokens_map.json
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"<|start_of_role|>",
|
4 |
+
"<|end_of_role|>",
|
5 |
+
"<|tool_call|>"
|
6 |
+
],
|
7 |
+
"bos_token": {
|
8 |
+
"content": "<|end_of_text|>",
|
9 |
+
"lstrip": false,
|
10 |
+
"normalized": false,
|
11 |
+
"rstrip": false,
|
12 |
+
"single_word": false
|
13 |
+
},
|
14 |
+
"eos_token": {
|
15 |
+
"content": "<|end_of_text|>",
|
16 |
+
"lstrip": false,
|
17 |
+
"normalized": false,
|
18 |
+
"rstrip": false,
|
19 |
+
"single_word": false
|
20 |
+
},
|
21 |
+
"pad_token": {
|
22 |
+
"content": "<|end_of_text|>",
|
23 |
+
"lstrip": false,
|
24 |
+
"normalized": false,
|
25 |
+
"rstrip": false,
|
26 |
+
"single_word": false
|
27 |
+
},
|
28 |
+
"unk_token": {
|
29 |
+
"content": "<|end_of_text|>",
|
30 |
+
"lstrip": false,
|
31 |
+
"normalized": false,
|
32 |
+
"rstrip": false,
|
33 |
+
"single_word": false
|
34 |
+
}
|
35 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,198 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"add_bos_token": false,
|
3 |
+
"add_prefix_space": false,
|
4 |
+
"added_tokens_decoder": {
|
5 |
+
"0": {
|
6 |
+
"content": "<|end_of_text|>",
|
7 |
+
"lstrip": false,
|
8 |
+
"normalized": false,
|
9 |
+
"rstrip": false,
|
10 |
+
"single_word": false,
|
11 |
+
"special": true
|
12 |
+
},
|
13 |
+
"1": {
|
14 |
+
"content": "<fim_prefix>",
|
15 |
+
"lstrip": false,
|
16 |
+
"normalized": false,
|
17 |
+
"rstrip": false,
|
18 |
+
"single_word": false,
|
19 |
+
"special": true
|
20 |
+
},
|
21 |
+
"2": {
|
22 |
+
"content": "<fim_middle>",
|
23 |
+
"lstrip": false,
|
24 |
+
"normalized": false,
|
25 |
+
"rstrip": false,
|
26 |
+
"single_word": false,
|
27 |
+
"special": true
|
28 |
+
},
|
29 |
+
"3": {
|
30 |
+
"content": "<fim_suffix>",
|
31 |
+
"lstrip": false,
|
32 |
+
"normalized": false,
|
33 |
+
"rstrip": false,
|
34 |
+
"single_word": false,
|
35 |
+
"special": true
|
36 |
+
},
|
37 |
+
"4": {
|
38 |
+
"content": "<fim_pad>",
|
39 |
+
"lstrip": false,
|
40 |
+
"normalized": false,
|
41 |
+
"rstrip": false,
|
42 |
+
"single_word": false,
|
43 |
+
"special": true
|
44 |
+
},
|
45 |
+
"5": {
|
46 |
+
"content": "<filename>",
|
47 |
+
"lstrip": false,
|
48 |
+
"normalized": false,
|
49 |
+
"rstrip": false,
|
50 |
+
"single_word": false,
|
51 |
+
"special": true
|
52 |
+
},
|
53 |
+
"6": {
|
54 |
+
"content": "<gh_stars>",
|
55 |
+
"lstrip": false,
|
56 |
+
"normalized": false,
|
57 |
+
"rstrip": false,
|
58 |
+
"single_word": false,
|
59 |
+
"special": true
|
60 |
+
},
|
61 |
+
"7": {
|
62 |
+
"content": "<issue_start>",
|
63 |
+
"lstrip": false,
|
64 |
+
"normalized": false,
|
65 |
+
"rstrip": false,
|
66 |
+
"single_word": false,
|
67 |
+
"special": true
|
68 |
+
},
|
69 |
+
"8": {
|
70 |
+
"content": "<issue_comment>",
|
71 |
+
"lstrip": false,
|
72 |
+
"normalized": false,
|
73 |
+
"rstrip": false,
|
74 |
+
"single_word": false,
|
75 |
+
"special": true
|
76 |
+
},
|
77 |
+
"9": {
|
78 |
+
"content": "<issue_closed>",
|
79 |
+
"lstrip": false,
|
80 |
+
"normalized": false,
|
81 |
+
"rstrip": false,
|
82 |
+
"single_word": false,
|
83 |
+
"special": true
|
84 |
+
},
|
85 |
+
"10": {
|
86 |
+
"content": "<jupyter_start>",
|
87 |
+
"lstrip": false,
|
88 |
+
"normalized": false,
|
89 |
+
"rstrip": false,
|
90 |
+
"single_word": false,
|
91 |
+
"special": true
|
92 |
+
},
|
93 |
+
"11": {
|
94 |
+
"content": "<jupyter_text>",
|
95 |
+
"lstrip": false,
|
96 |
+
"normalized": false,
|
97 |
+
"rstrip": false,
|
98 |
+
"single_word": false,
|
99 |
+
"special": true
|
100 |
+
},
|
101 |
+
"12": {
|
102 |
+
"content": "<jupyter_code>",
|
103 |
+
"lstrip": false,
|
104 |
+
"normalized": false,
|
105 |
+
"rstrip": false,
|
106 |
+
"single_word": false,
|
107 |
+
"special": true
|
108 |
+
},
|
109 |
+
"13": {
|
110 |
+
"content": "<jupyter_output>",
|
111 |
+
"lstrip": false,
|
112 |
+
"normalized": false,
|
113 |
+
"rstrip": false,
|
114 |
+
"single_word": false,
|
115 |
+
"special": true
|
116 |
+
},
|
117 |
+
"14": {
|
118 |
+
"content": "<empty_output>",
|
119 |
+
"lstrip": false,
|
120 |
+
"normalized": false,
|
121 |
+
"rstrip": false,
|
122 |
+
"single_word": false,
|
123 |
+
"special": true
|
124 |
+
},
|
125 |
+
"15": {
|
126 |
+
"content": "<commit_before>",
|
127 |
+
"lstrip": false,
|
128 |
+
"normalized": false,
|
129 |
+
"rstrip": false,
|
130 |
+
"single_word": false,
|
131 |
+
"special": true
|
132 |
+
},
|
133 |
+
"16": {
|
134 |
+
"content": "<commit_msg>",
|
135 |
+
"lstrip": false,
|
136 |
+
"normalized": false,
|
137 |
+
"rstrip": false,
|
138 |
+
"single_word": false,
|
139 |
+
"special": true
|
140 |
+
},
|
141 |
+
"17": {
|
142 |
+
"content": "<commit_after>",
|
143 |
+
"lstrip": false,
|
144 |
+
"normalized": false,
|
145 |
+
"rstrip": false,
|
146 |
+
"single_word": false,
|
147 |
+
"special": true
|
148 |
+
},
|
149 |
+
"18": {
|
150 |
+
"content": "<reponame>",
|
151 |
+
"lstrip": false,
|
152 |
+
"normalized": false,
|
153 |
+
"rstrip": false,
|
154 |
+
"single_word": false,
|
155 |
+
"special": true
|
156 |
+
},
|
157 |
+
"49152": {
|
158 |
+
"content": "<|start_of_role|>",
|
159 |
+
"lstrip": false,
|
160 |
+
"normalized": false,
|
161 |
+
"rstrip": false,
|
162 |
+
"single_word": false,
|
163 |
+
"special": true
|
164 |
+
},
|
165 |
+
"49153": {
|
166 |
+
"content": "<|end_of_role|>",
|
167 |
+
"lstrip": false,
|
168 |
+
"normalized": false,
|
169 |
+
"rstrip": false,
|
170 |
+
"single_word": false,
|
171 |
+
"special": true
|
172 |
+
},
|
173 |
+
"49154": {
|
174 |
+
"content": "<|tool_call|>",
|
175 |
+
"lstrip": false,
|
176 |
+
"normalized": false,
|
177 |
+
"rstrip": false,
|
178 |
+
"single_word": false,
|
179 |
+
"special": true
|
180 |
+
}
|
181 |
+
},
|
182 |
+
"additional_special_tokens": [
|
183 |
+
"<|start_of_role|>",
|
184 |
+
"<|end_of_role|>",
|
185 |
+
"<|tool_call|>"
|
186 |
+
],
|
187 |
+
"bos_token": "<|end_of_text|>",
|
188 |
+
"chat_template": "{%- if tools %}\n {{- '<|start_of_role|>available_tools<|end_of_role|>\n' }}\n {%- for tool in tools %}\n {{- tool | tojson(indent=4) }}\n {%- if not loop.last %}\n {{- '\n\n' }}\n {%- endif %}\n {%- endfor %}\n {{- '<|end_of_text|>\n' }}\n{%- endif %}\n{%- for message in messages %}\n {%- if message['role'] == 'system' %}\n {{- '<|start_of_role|>system<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'user' %}\n {{- '<|start_of_role|>user<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'assistant' %}\n {{- '<|start_of_role|>assistant<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'assistant_tool_call' %}\n {{- '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'tool_response' %}\n {{- '<|start_of_role|>tool_response<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- endif %}\n {%- if loop.last and add_generation_prompt %}\n {{- '<|start_of_role|>assistant<|end_of_role|>' }}\n {%- endif %}\n{%- endfor %}",
|
189 |
+
"clean_up_tokenization_spaces": true,
|
190 |
+
"eos_token": "<|end_of_text|>",
|
191 |
+
"errors": "replace",
|
192 |
+
"model_max_length": 9223372036854775807,
|
193 |
+
"pad_token": "<|end_of_text|>",
|
194 |
+
"padding_side": "left",
|
195 |
+
"tokenizer_class": "GPT2Tokenizer",
|
196 |
+
"unk_token": "<|end_of_text|>",
|
197 |
+
"vocab_size": 49152
|
198 |
+
}
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|