TomPei commited on
Commit
1bd3944
·
verified ·
1 Parent(s): 3cc12ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md CHANGED
@@ -18,3 +18,106 @@ OpenCSG stands for Converged resources, Software refined, and Generative LM. The
18
 
19
  The vision of OpenCSG is to empower every industry, every company, and every individual to own their models. We adhere to the principles of openness and open source, making the large model software stack of OpenCSG available to the community. We welcome everyone to use, feedback, and collaborative contribute.
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  The vision of OpenCSG is to empower every industry, every company, and every individual to own their models. We adhere to the principles of openness and open source, making the large model software stack of OpenCSG available to the community. We welcome everyone to use, feedback, and collaborative contribute.
20
 
21
+
22
+
23
+ ## Model Description
24
+
25
+ The [StarCoder](https://huggingface.co/bigcode/starcoder) models are 15.5B parameter models trained on 80+ programming languages from [The Stack (v1.2)](https://huggingface.co/datasets/bigcode/the-stack), with opt-out requests excluded.
26
+ <br>
27
+
28
+ This is the repository for the base 7B version finetuned based on [CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf).
29
+
30
+ | Model Size | Base Model |
31
+ | --- | ----------------------------------------------------------------------------- |
32
+ | 7B | [opencsg/Opencsg-CodeLlama-7b-v0.1](https://huggingface.co/opencsg/opencsg-CodeLlama-7b-v0.1) |
33
+ | 13B | [opencsg/Opencsg-CodeLlama-13b-v0.1](https://huggingface.co/opencsg/opencsg-CodeLlama-13b-v0.1) |
34
+ | 34B | [opencsg/Opencsg-CodeLlama-34b-v0.1](https://huggingface.co/opencsg/opencsg-CodeLlama-34b-v0.1) |
35
+
36
+
37
+ ## Model Eval
38
+
39
+ HumanEval is the commonest code generation benchmark to evaluate the performance of models, especially on the the compeltion of code exercise cases.
40
+ Somehow, model evaluation is a kind of metaphysics. Different models are sensitive to different decoding methods, paramters and instructions.
41
+ It is impratical for us to manually set specific configuration for each fine-tuned model, because a real LLM should master the universal capability despite the parameters manipulated by users.
42
+
43
+ Thus, OpenCSG strained our brains to provide a relatively fair method to compare the fine-tuned models on HumanEval benchmark.
44
+ To simplify the comparision, we chosed the Pass@1 metric on python language, but our finetuning dataset includes samples in multi language.
45
+
46
+ **For fair, we evaluated the fine-tuned and origin codellama models only with the original cases' prompts, not including any other instruction else.**
47
+
48
+ **Otherwise, we use greedy decoding method for each model during the evaluation.**
49
+
50
+ | Model | HumanEval python pass@1 |
51
+ | --- |----------------------------------------------------------------------------- |
52
+ | CodeLlama-7b-hf | 30.5%|
53
+ | opencsg-CodeLlama-7b-v0.1(4k) | **42.7%** |
54
+ | CodeLlama-13b-hf | 36.0%|
55
+ | opencsg-CodeLlama-13b-v0.1(4k) | **45.1%** |
56
+ | CodeLlama-34b-hf | 48.2%|
57
+ | opencsg-CodeLlama-34b-v0.1(4k)| **48.8%** |
58
+
59
+ **TODO**
60
+ - we will provide much more benchmark scores on fine-tuned models in future.
61
+ - we will provide different practical problems to evaluate the performance of fine-tuned models in the field of software engineering.
62
+
63
+
64
+
65
+ # Model Usage
66
+
67
+
68
+ ```python
69
+ from transformers import AutoTokenizer
70
+ import transformers
71
+ import torch
72
+
73
+ model = "opencsg/opencsg-CodeLlama-7b-v0.1"
74
+
75
+ tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
76
+ pipeline = transformers.pipeline(
77
+ "text-generation",
78
+ model=model,
79
+ torch_dtype=torch.float16,
80
+ device_map="auto",
81
+ )
82
+ input_text = """def quick_sort(arr):
83
+ if len(arr) <= 1:
84
+ return arr
85
+ pivot = arr[0]
86
+ left = []
87
+ right = []
88
+ <FILL_ME>
89
+ if arr[i] < pivot:
90
+ left.append(arr[i])
91
+ else:
92
+ right.append(arr[i])
93
+ return quick_sort(left) + [pivot] + quick_sort(right)
94
+ """
95
+ sequences = pipeline(
96
+ input_text,
97
+ do_sample=False,
98
+ top_k=10,
99
+ temperature=0.1,
100
+ top_p=0.95,
101
+ num_return_sequences=1,
102
+ eos_token_id=tokenizer1.eos_token_id,
103
+ max_length=256,
104
+ )
105
+ for seq in sequences:
106
+ print(seq['generated_text'][len(input_text):])
107
+ ```
108
+ # Training
109
+
110
+ ## Basic Model
111
+
112
+ [codellama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)
113
+
114
+ ## Hardware
115
+
116
+ - **GPUs:** 8 Tesla A800
117
+ - **Training time:** 4 hours
118
+
119
+ ## Software
120
+
121
+ - **Orchestration:** [Deepspeed](https://github.com/OpenCSGs)
122
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
123
+ - **BP16 if applicable:** [apex](https://github.com/NVIDIA/apex)