michaelfeil commited on
Commit
c0b1804
1 Parent(s): 1fd93c6

Upload bigcode/gpt_bigcode-santacoder ctranslate fp16 weights

Browse files
README.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - ctranslate2
4
+ - int8
5
+ - float16
6
+
7
+ license: openrail
8
+ datasets:
9
+ - bigcode/the-stack
10
+ language:
11
+ - code
12
+ programming_language:
13
+ - Java
14
+ - JavaScript
15
+ - Python
16
+ pipeline_tag: text-generation
17
+ inference: false
18
+
19
+ model-index:
20
+ - name: SantaCoder
21
+ results:
22
+ - task:
23
+ type: text-generation
24
+ dataset:
25
+ type: nuprl/MultiPL-E
26
+ name: MultiPL HumanEval (Python)
27
+ metrics:
28
+ - name: pass@1
29
+ type: pass@1
30
+ value: 0.18
31
+ verified: false
32
+ - name: pass@10
33
+ type: pass@10
34
+ value: 0.29
35
+ verified: false
36
+ - name: pass@100
37
+ type: pass@100
38
+ value: 0.49
39
+ verified: false
40
+ - task:
41
+ type: text-generation
42
+ dataset:
43
+ type: nuprl/MultiPL-E
44
+ name: MultiPL MBPP (Python)
45
+ metrics:
46
+ - name: pass@1
47
+ type: pass@1
48
+ value: 0.35
49
+ verified: false
50
+ - name: pass@10
51
+ type: pass@10
52
+ value: 0.58
53
+ verified: false
54
+ - name: pass@100
55
+ type: pass@100
56
+ value: 0.77
57
+ verified: false
58
+ - task:
59
+ type: text-generation
60
+ dataset:
61
+ type: nuprl/MultiPL-E
62
+ name: MultiPL HumanEval (JavaScript)
63
+ metrics:
64
+ - name: pass@1
65
+ type: pass@1
66
+ value: 0.16
67
+ verified: false
68
+ - name: pass@10
69
+ type: pass@10
70
+ value: 0.27
71
+ verified: false
72
+ - name: pass@100
73
+ type: pass@100
74
+ value: 0.47
75
+ verified: false
76
+ - task:
77
+ type: text-generation
78
+ dataset:
79
+ type: nuprl/MultiPL-E
80
+ name: MultiPL MBPP (Javascript)
81
+ metrics:
82
+ - name: pass@1
83
+ type: pass@1
84
+ value: 0.28
85
+ verified: false
86
+ - name: pass@10
87
+ type: pass@10
88
+ value: 0.51
89
+ verified: false
90
+ - name: pass@100
91
+ type: pass@100
92
+ value: 0.70
93
+ verified: false
94
+ - task:
95
+ type: text-generation
96
+ dataset:
97
+ type: nuprl/MultiPL-E
98
+ name: MultiPL HumanEval (Java)
99
+ metrics:
100
+ - name: pass@1
101
+ type: pass@1
102
+ value: 0.15
103
+ verified: false
104
+ - name: pass@10
105
+ type: pass@10
106
+ value: 0.26
107
+ verified: false
108
+ - name: pass@100
109
+ type: pass@100
110
+ value: 0.41
111
+ verified: false
112
+ - task:
113
+ type: text-generation
114
+ dataset:
115
+ type: nuprl/MultiPL-E
116
+ name: MultiPL MBPP (Java)
117
+ metrics:
118
+ - name: pass@1
119
+ type: pass@1
120
+ value: 0.28
121
+ verified: false
122
+ - name: pass@10
123
+ type: pass@10
124
+ value: 0.44
125
+ verified: false
126
+ - name: pass@100
127
+ type: pass@100
128
+ value: 0.59
129
+ verified: false
130
+ - task:
131
+ type: text-generation
132
+ dataset:
133
+ type: loubnabnl/humaneval_infilling
134
+ name: HumanEval FIM (Python)
135
+ metrics:
136
+ - name: single_line
137
+ type: exact_match
138
+ value: 0.44
139
+ verified: false
140
+ - task:
141
+ type: text-generation
142
+ dataset:
143
+ type: nuprl/MultiPL-E
144
+ name: MultiPL HumanEval FIM (Java)
145
+ metrics:
146
+ - name: single_line
147
+ type: exact_match
148
+ value: 0.62
149
+ verified: false
150
+ - task:
151
+ type: text-generation
152
+ dataset:
153
+ type: nuprl/MultiPL-E
154
+ name: MultiPL HumanEval FIM (JavaScript)
155
+ metrics:
156
+ - name: single_line
157
+ type: exact_match
158
+ value: 0.60
159
+ verified: false
160
+ - task:
161
+ type: text-generation
162
+ dataset:
163
+ type: code_x_glue_ct_code_to_text
164
+ name: CodeXGLUE code-to-text (Python)
165
+ metrics:
166
+ - name: BLEU
167
+ type: bleu
168
+ value: 18.13
169
+ verified: false
170
+ ---
171
+ # # Fast-Inference with Ctranslate2
172
+ Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.
173
+
174
+ quantized version of [bigcode/gpt_bigcode-santacoder](https://huggingface.co/bigcode/gpt_bigcode-santacoder)
175
+ ```bash
176
+ pip install hf-hub-ctranslate2>=2.0.8
177
+ ```
178
+ Converted on 2023-05-23 using
179
+ ```
180
+ ct2-transformers-converter --model bigcode/gpt_bigcode-santacoder --output_dir /home/michael/tmp-ct2fast-gpt_bigcode-santacoder --force --copy_files tokenizer.json README.md tokenizer_config.json special_tokens_map.json .gitattributes --quantization float16
181
+ ```
182
+
183
+ Checkpoint compatible to [ctranslate2>=3.13.0](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2>=2.0.6](https://github.com/michaelfeil/hf-hub-ctranslate2)
184
+ - `compute_type=int8_float16` for `device="cuda"`
185
+ - `compute_type=int8` for `device="cpu"`
186
+
187
+ ```python
188
+ from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
189
+ from transformers import AutoTokenizer
190
+
191
+ model_name = "michaelfeil/ct2fast-gpt_bigcode-santacoder"
192
+ # use either TranslatorCT2fromHfHub or GeneratorCT2fromHfHub here, depending on model.
193
+ model = GeneratorCT2fromHfHub(
194
+ # load in int8 on CUDA
195
+ model_name_or_path=model_name,
196
+ device="cuda",
197
+ compute_type="int8_float16",
198
+ # tokenizer=AutoTokenizer.from_pretrained("bigcode/gpt_bigcode-santacoder")
199
+ )
200
+ outputs = model.generate(
201
+ text=["How do you call a fast Flan-ingo?", "User: How are you doing? Bot:"],
202
+ max_length=64
203
+ )
204
+ print(outputs)
205
+ ```
206
+
207
+ # Licence and other remarks:
208
+ This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
209
+
210
+ # Original description
211
+
212
+
213
+ # SantaCoder
214
+
215
+ ![banner](https://huggingface.co/datasets/bigcode/admin/resolve/main/banner.png)
216
+
217
+ Play with the model on the [SantaCoder Space Demo](https://huggingface.co/spaces/bigcode/santacoder-demo).
218
+
219
+ # Table of Contents
220
+
221
+ 1. [Model Summary](#model-summary)
222
+ 2. [Use](#use)
223
+ 3. [Limitations](#limitations)
224
+ 4. [Training](#training)
225
+ 5. [License](#license)
226
+ 6. [Citation](#citation)
227
+
228
+ # Model Summary
229
+
230
+ This is the Megatron-version of [SantaCoder](https://huggingface.co/bigcode/santacoder).
231
+ We refer the reader to the [SantaCoder model page](https://huggingface.co/bigcode/santacoder) for full documentation about this model
232
+
233
+
234
+ - **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
235
+ - **Project Website:** [bigcode-project.org](www.bigcode-project.org)
236
+ - **Paper:** [🎅SantaCoder: Don't reach for the stars!🌟](https://t.co/YV3pzUbYOr)
237
+ - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
238
+ - **Languages:** Python, Java, and JavaScript
239
+
240
+ There are two versions (branches) of the model:
241
+ * `main`: Uses the `gpt_bigcode` model. [Requires the bigcode fork of transformers](https://github.com/bigcode-project/transformers).
242
+ * `main_custom`: Packaged with its modeling code. Requires `transformers>=4.27`.
243
+ Alternatively, it can run on older versions by setting the configuration parameter `activation_function = "gelu_pytorch_tanh"`.
244
+
245
+ # Use
246
+
247
+ ## Intended use
248
+
249
+ The model was trained on GitHub code. As such it is _not_ an instruction model and commands like "Write a function that computes the square root." do not work well.
250
+ You should phrase commands like they occur in source code such as comments (e.g. `# the following function computes the sqrt`) or write a function signature and docstring and let the model complete the function body.
251
+
252
+ ### Attribution & Other Requirements
253
+
254
+ The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected. We provide a [search index](https://huggingface.co/spaces/bigcode/santacoder-search) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.
255
+
256
+ # Limitations
257
+
258
+ The model has been trained on source code in Python, Java, and JavaScript. The predominant language in source is English although other languages are also present. As such the model is capable to generate code snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient, contain bugs or exploits.
259
+
260
+ # Training
261
+
262
+ ## Model
263
+
264
+ - **Architecture:** GPT-2 model with multi-query attention and Fill-in-the-Middle objective
265
+ - **Pretraining steps:** 600K
266
+ - **Pretraining tokens:** 236 billion
267
+ - **Precision:** float16
268
+
269
+ ## Hardware
270
+
271
+ - **GPUs:** 96 Tesla V100
272
+ - **Training time:** 6.2 days
273
+ - **Total FLOPS:** 2.1 x 10e21
274
+
275
+ ## Software
276
+
277
+ - **Orchestration:** [Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
278
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
279
+ - **FP16 if applicable:** [apex](https://github.com/NVIDIA/apex)
280
+
281
+ # License
282
+ The model is licenses under the CodeML Open RAIL-M v0.1 license. You can find the full license [here](https://huggingface.co/spaces/bigcode/license).
config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22d86b7c0f6c13408c8e06ced4c22979d55e136c26bdcb586f5c12ad03c2b7e4
3
+ size 2627461067
special_tokens_map.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<fim-prefix>",
5
+ "<fim-middle>",
6
+ "<fim-suffix>",
7
+ "<fim-pad>"
8
+ ]
9
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "errors": "replace",
3
+ "tokenizer_class": "GPT2TokenizerFast",
4
+ "bos_token": "<|endoftext|>",
5
+ "eos_token": "<|endoftext|>",
6
+ "model_max_length": 2048
7
+ }
vocabulary.txt ADDED
The diff for this file is too large to render. See raw diff