rpand002 commited on
Commit
7a9a4d5
1 Parent(s): 0e26cfe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -111
README.md CHANGED
@@ -3,15 +3,15 @@ pipeline_tag: text-generation
3
  inference: false
4
  license: apache-2.0
5
  datasets:
6
- - codeparrot/github-code-clean
7
- - bigcode/starcoderdata
8
- # - Stackexchange
9
- # - CommonCrawl
10
- - open-web-math/open-web-math
11
- - math-ai/StackMathQA
12
- # - Arxiv
13
- # - Wikipedia
14
- # - conceptofmind/FLAN_2022 # Original link is broken, we used IBM's filtered version
15
  metrics:
16
  - code_eval
17
  library_name: transformers
@@ -19,7 +19,7 @@ tags:
19
  - code
20
  - granite
21
  model-index:
22
- - name: granite-8b-code-base-128k
23
  results:
24
  - task:
25
  type: text-generation
@@ -29,7 +29,7 @@ model-index:
29
  metrics:
30
  - name: pass@1
31
  type: pass@1
32
- value: 43.1
33
  verified: false
34
  - task:
35
  type: text-generation
@@ -39,7 +39,7 @@ model-index:
39
  metrics:
40
  - name: pass@1
41
  type: pass@1
42
- value: 40.2
43
  verified: false
44
  - task:
45
  type: text-generation
@@ -49,7 +49,7 @@ model-index:
49
  metrics:
50
  - name: pass@1
51
  type: pass@1
52
- value: 28.2
53
  verified: false
54
  - task:
55
  type: text-generation
@@ -59,7 +59,7 @@ model-index:
59
  metrics:
60
  - name: pass@1
61
  type: pass@1
62
- value: 25.2
63
  verified: false
64
  - task:
65
  type: text-generation
@@ -69,7 +69,7 @@ model-index:
69
  metrics:
70
  - name: pass@1 (thresh=0.5)
71
  type: pass@1 (thresh=0.5)
72
- value: 65.0
73
  verified: false
74
  - task:
75
  type: text-generation
@@ -79,7 +79,7 @@ model-index:
79
  metrics:
80
  - name: pass@1 (thresh=0.5)
81
  type: pass@1 (thresh=0.5)
82
- value: 35.0
83
  verified: false
84
  - task:
85
  type: text-generation
@@ -89,7 +89,7 @@ model-index:
89
  metrics:
90
  - name: pass@1 (thresh=0.5)
91
  type: pass@1 (thresh=0.5)
92
- value: 39.0
93
  verified: false
94
  - task:
95
  type: text-generation
@@ -99,7 +99,7 @@ model-index:
99
  metrics:
100
  - name: pass@1 (thresh=0.5)
101
  type: pass@1 (thresh=0.5)
102
- value: 40.0
103
  verified: false
104
  - task:
105
  type: text-generation
@@ -109,97 +109,17 @@ model-index:
109
  metrics:
110
  - name: pass@1 (thresh=0.5)
111
  type: pass@1 (thresh=0.5)
112
- value: 54.0
113
- verified: false
114
- - task:
115
- type: text-generation
116
- dataset:
117
- type: lcc
118
- name: LCC (Balanced)
119
- metrics:
120
- - name: Exact Match@4K
121
- type: Exact Match@4K
122
- value: 56.5
123
- verified: false
124
- - task:
125
- type: text-generation
126
- dataset:
127
- type: lcc
128
- name: LCC (Balanced)
129
- metrics:
130
- - name: Exact Match@8K
131
- type: Exact Match@8K
132
- value: 60.1
133
- verified: false
134
- - task:
135
- type: text-generation
136
- dataset:
137
- type: lcc
138
- name: LCC (Balanced)
139
- metrics:
140
- - name: Exact Match@16K
141
- type: Exact Match@16K
142
- value: 51.8
143
- verified: false
144
- - task:
145
- type: text-generation
146
- dataset:
147
- type: lcc
148
- name: LCC (Balanced)
149
- metrics:
150
- - name: Exact Match@32K
151
- type: Exact Match@32K
152
- value: 57.4
153
- verified: false
154
- - task:
155
- type: text-generation
156
- dataset:
157
- type: repobench
158
- name: RepoBench-P (Balanced)
159
- metrics:
160
- - name: Exact Match@4K
161
- type: Exact Match@4K
162
- value: 42.7
163
- verified: false
164
- - task:
165
- type: text-generation
166
- dataset:
167
- type: repobench
168
- name: RepoBench-P (Balanced)
169
- metrics:
170
- - name: Exact Match@8K
171
- type: Exact Match@8K
172
- value: 44.0
173
- verified: false
174
- - task:
175
- type: text-generation
176
- dataset:
177
- type: repobench
178
- name: RepoBench-P (Balanced)
179
- metrics:
180
- - name: Exact Match@16K
181
- type: Exact Match@16K
182
- value: 44.8
183
- verified: false
184
- - task:
185
- type: text-generation
186
- dataset:
187
- type: repobench
188
- name: RepoBench-Pn(Balanced)
189
- metrics:
190
- - name: Exact Match@32K
191
- type: Exact Match@32K
192
- value: 44.5
193
  verified: false
194
  ---
195
 
 
196
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62cd5057674cdb524450093d/1hzxoPwqkBJXshKVVe6_9.png)
197
 
198
- # Granite-8B-Code-Base-128K
199
 
200
  ## Model Summary
201
- **Granite-8B-Code-Base-128K** extends the context length of Granite-8B-Code-Base from 4K to 128K with continual pretraining using the original training data but with repository-level file packing and per-language length upsampling, that we found to be critical for long-context pretraining.
202
- We adopt an progressive training strategy where we doubled the context window until it reached the desired length of 128K by appropriately adjusting RoPE theta. We trained on 4B tokens total for all stages, which is only 0.1% of Granite-8B-Code-Base's original pre-training data.
203
 
204
  - **Developers:** IBM Research
205
  - **GitHub Repository:** [ibm-granite/granite-code-models](https://github.com/ibm-granite/granite-code-models)
@@ -209,29 +129,34 @@ We adopt an progressive training strategy where we doubled the context window un
209
 
210
  ## Usage
211
  ### Intended use
212
- Prominent enterprise use cases of LLMs in software engineering productivity with 128K context length support that includes code generation, code explanation, code fixing, generating unit tests, generating documentation, addressing technical debt issues, vulnerability detection, code translation, and more. All Granite Code Base models, including the **3B parameter model**, are able to handle these tasks as they were trained on a large amount of code data from 116 programming languages.
 
 
213
 
214
  ### Generation
215
- This is a simple example of how to use **Granite-8B-Code-Base-128K** model.
216
 
217
  ```python
218
  import torch
219
  from transformers import AutoModelForCausalLM, AutoTokenizer
220
  device = "cuda" # or "cpu"
221
- model_path = "ibm-granite/granite-8b-code-base-128K"
222
  tokenizer = AutoTokenizer.from_pretrained(model_path)
223
  # drop device_map if running on CPU
224
  model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
225
  model.eval()
226
  # change input text as desired
227
- input_text = "def generate():"
 
 
 
228
  # tokenize the text
229
- input_tokens = tokenizer(input_text, return_tensors="pt")
230
  # transfer tokenized inputs to the device
231
  for i in input_tokens:
232
  input_tokens[i] = input_tokens[i].to(device)
233
  # generate output tokens
234
- output = model.generate(**input_tokens)
235
  # decode output tokens into text
236
  output = tokenizer.batch_decode(output)
237
  # loop over the batch to print, in this example the batch size is 1
@@ -239,11 +164,14 @@ for i in output:
239
  print(i)
240
  ```
241
 
 
242
  ## Training Data
243
- Starting from the base Granite model, this model was further pretrained on repository-level code data with per-language oversampling, allowing it to effectively utilize up to 128K tokens of context. This continued training stage focused on a curated selection of programming languages, such as Python, C, C++, Go, Java, JavaScript, and TypeScript.
244
-
 
 
245
  ## Infrastructure
246
  We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs.
247
 
248
  ## Ethical Considerations and Limitations
249
- The use of Large Language Models involves risks and ethical considerations people must be aware of. Regarding code generation, caution is urged against complete reliance on specific code models for crucial decisions or impactful information as the generated code is not guaranteed to work as intended. **Granite-8B-Code-Base-128K** model is not the exception in this regard. Even though this model is suited for multiple code-related tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying source code verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use **Granite-8B-Code-Base-128K** model with ethical intentions and in a responsible way. 
 
3
  inference: false
4
  license: apache-2.0
5
  datasets:
6
+ - bigcode/commitpackft
7
+ - TIGER-Lab/MathInstruct
8
+ - meta-math/MetaMathQA
9
+ - glaiveai/glaive-code-assistant-v3
10
+ - glaive-function-calling-v2
11
+ - bugdaryan/sql-create-context-instruction
12
+ - garage-bAInd/Open-Platypus
13
+ - nvidia/HelpSteer
14
+ - bigcode/self-oss-instruct-sc2-exec-filter-50k
15
  metrics:
16
  - code_eval
17
  library_name: transformers
 
19
  - code
20
  - granite
21
  model-index:
22
+ - name: granite-8B-Code-instruct-128k
23
  results:
24
  - task:
25
  type: text-generation
 
29
  metrics:
30
  - name: pass@1
31
  type: pass@1
32
+ value: 62.2
33
  verified: false
34
  - task:
35
  type: text-generation
 
39
  metrics:
40
  - name: pass@1
41
  type: pass@1
42
+ value: 51.4
43
  verified: false
44
  - task:
45
  type: text-generation
 
49
  metrics:
50
  - name: pass@1
51
  type: pass@1
52
+ value: 38.9
53
  verified: false
54
  - task:
55
  type: text-generation
 
59
  metrics:
60
  - name: pass@1
61
  type: pass@1
62
+ value: 38.3
63
  verified: false
64
  - task:
65
  type: text-generation
 
69
  metrics:
70
  - name: pass@1 (thresh=0.5)
71
  type: pass@1 (thresh=0.5)
72
+ value: 73.0
73
  verified: false
74
  - task:
75
  type: text-generation
 
79
  metrics:
80
  - name: pass@1 (thresh=0.5)
81
  type: pass@1 (thresh=0.5)
82
+ value: 37.0
83
  verified: false
84
  - task:
85
  type: text-generation
 
89
  metrics:
90
  - name: pass@1 (thresh=0.5)
91
  type: pass@1 (thresh=0.5)
92
+ value: 73.0
93
  verified: false
94
  - task:
95
  type: text-generation
 
99
  metrics:
100
  - name: pass@1 (thresh=0.5)
101
  type: pass@1 (thresh=0.5)
102
+ value: 62.0
103
  verified: false
104
  - task:
105
  type: text-generation
 
109
  metrics:
110
  - name: pass@1 (thresh=0.5)
111
  type: pass@1 (thresh=0.5)
112
+ value: 63.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  verified: false
114
  ---
115
 
116
+
117
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62cd5057674cdb524450093d/1hzxoPwqkBJXshKVVe6_9.png)
118
 
119
+ # Granite-8B-Code-Instruct-128K
120
 
121
  ## Model Summary
122
+ **Granite-8B-Code-Instruct-128K** is a 3B parameter long-context instruct model fine tuned from *Granite-8B-Code-Base-128K* on a combination of **permissively licensed** data used in training the original Granite code instruct models, in addition to synthetically generated code instruction datasets tailored for solving long context problems. By exposing the model to both short and long context data, we aim to enhance its long-context capability without sacrificing code generation performance at short input context.
 
123
 
124
  - **Developers:** IBM Research
125
  - **GitHub Repository:** [ibm-granite/granite-code-models](https://github.com/ibm-granite/granite-code-models)
 
129
 
130
  ## Usage
131
  ### Intended use
132
+ The model is designed to respond to coding related instructions over long-conext input and can be used to build coding assistants.
133
+
134
+ <!-- TO DO: Check starcoder2 instruct code example that includes the template https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1 -->
135
 
136
  ### Generation
137
+ This is a simple example of how to use **Granite-8B-Code-Instruct** model.
138
 
139
  ```python
140
  import torch
141
  from transformers import AutoModelForCausalLM, AutoTokenizer
142
  device = "cuda" # or "cpu"
143
+ model_path = "ibm-granite/granite-8B-Code-instruct-128k"
144
  tokenizer = AutoTokenizer.from_pretrained(model_path)
145
  # drop device_map if running on CPU
146
  model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
147
  model.eval()
148
  # change input text as desired
149
+ chat = [
150
+ { "role": "user", "content": "Write a code to find the maximum value in a list of numbers." },
151
+ ]
152
+ chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
153
  # tokenize the text
154
+ input_tokens = tokenizer(chat, return_tensors="pt")
155
  # transfer tokenized inputs to the device
156
  for i in input_tokens:
157
  input_tokens[i] = input_tokens[i].to(device)
158
  # generate output tokens
159
+ output = model.generate(**input_tokens, max_new_tokens=100)
160
  # decode output tokens into text
161
  output = tokenizer.batch_decode(output)
162
  # loop over the batch to print, in this example the batch size is 1
 
164
  print(i)
165
  ```
166
 
167
+ <!-- TO DO: Check this part -->
168
  ## Training Data
169
+ Granite Code Instruct models are trained on a mix of short and long context data as follows.
170
+ * Short-Context Instruction Data: [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft), [BigCode-SC2-Instruct](bigcode/self-oss-instruct-sc2-exec-filter-50k), [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct), [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA), [Glaive-Code-Assistant-v3](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3), [Glaive-Function-Calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2), [NL2SQL11](https://huggingface.co/datasets/bugdaryan/sql-create-context-instruction), [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer), [OpenPlatypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) including a synthetically generated dataset for API calling and multi-turn code interaction with execution feedback. We also include a collection of hardcoded prompts to ensure our model generates correct outputs given inquiries about its name or developers.
171
+ * Long-Context Instruction Data: A synthetically-generated dataset by bootstrapping the repository-level file-packed documents through Granite-8b-Code-Instruct to improve long-context capability of the model.
172
+
173
  ## Infrastructure
174
  We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs.
175
 
176
  ## Ethical Considerations and Limitations
177
+ Granite code instruct models are primarily finetuned using instruction-response pairs across a specific set of programming languages. Thus, their performance may be limited with out-of-domain programming languages. In this situation, it is beneficial providing few-shot examples to steer the model's output. Moreover, developers should perform safety testing and target-specific tuning before deploying these models on critical applications. The model also inherits ethical considerations and limitations from its base model. For more information, please refer to *[Granite-8B-Code-Base-128K](https://huggingface.co/ibm-granite/granite-8B-Code-base-128k)* model card.