yuewang-sf
commited on
Commit
•
b93949f
1
Parent(s):
2333245
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,73 @@
|
|
1 |
---
|
2 |
license: bsd-3-clause
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: bsd-3-clause
|
3 |
---
|
4 |
+
|
5 |
+
# CodeT5+ 2B
|
6 |
+
|
7 |
+
## Model description
|
8 |
+
|
9 |
+
[CodeT5+](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
|
10 |
+
It is introduced in the paper:
|
11 |
+
|
12 |
+
[CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
|
13 |
+
by [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution).
|
14 |
+
|
15 |
+
Compared to the original CodeT5 family (base: `220M`, large: `770M`), CodeT5+ is pretrained with a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
|
16 |
+
Additionally, it employs a simple yet effective _compute-efficient pretraining_ method to initialize the model components with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen) to efficiently scale up the model (i.e. `2B`, `6B`, `16B`), and adopts a "shallow encoder and deep decoder" architecture.
|
17 |
+
Furthermore, it is instruction-tuned to align with natural language instructions (see our InstructCodeT5+ 16B) following [Code Alpaca](https://github.com/sahil280114/codealpaca).
|
18 |
+
|
19 |
+
## How to use
|
20 |
+
|
21 |
+
This model can be easily loaded using the `AutoModelForSeq2SeqLM` functionality and employs the same tokenizer as [CodeGen](https://github.com/salesforce/CodeGen).
|
22 |
+
|
23 |
+
```python
|
24 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
25 |
+
|
26 |
+
checkpoint = "Salesforce/codet5p-2b"
|
27 |
+
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
28 |
+
|
29 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
|
30 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
|
31 |
+
torch_dtype=torch.float16,
|
32 |
+
trust_remote_code=True).to(device)
|
33 |
+
|
34 |
+
inputs = tokenizer.encode("def print_hello():", return_tensors="pt").to(device)
|
35 |
+
outputs = model.generate(inputs, max_length=12)
|
36 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
37 |
+
```
|
38 |
+
|
39 |
+
## Pretraining data
|
40 |
+
|
41 |
+
This checkpoint is trained on the stricter permissive subset of the deduplicated version of the [github-code dataset](https://huggingface.co/datasets/codeparrot/github-code).
|
42 |
+
The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”, “cc0-1.0”, “unlicense”, “isc”).
|
43 |
+
Supported languages (9 in total) are as follows:
|
44 |
+
`c`, `c++`, `c-sharp`, `go`, `java`, `javascript`, `php`, `python`, `ruby.`
|
45 |
+
|
46 |
+
## Training procedure
|
47 |
+
|
48 |
+
This checkpoint is initialized from off-the-shelf LLMs, i.e. its encoder is initialized from CodeGen-350M-mono and its decoder is initialized from CodeGen-16B-mono.
|
49 |
+
It is trained on the unimodal code data at the first-stage pretraining, which includes a diverse set of pretraining tasks including _span denoising_ and two variants of _causal language modeling_.
|
50 |
+
After that, it is further trained on the Python subset with the causal language modeling objective for another epoch to better adapt for Python code generation.
|
51 |
+
Finally, we apply instruction tuning to align it with natural language instructions following [Code Alpaca](https://github.com/sahil280114/codealpaca).
|
52 |
+
Please refer to the paper for more details.
|
53 |
+
|
54 |
+
## Evaluation results
|
55 |
+
|
56 |
+
CodeT5+ models have been comprehensively evaluated on a wide range of code understanding and generation tasks in various settings: _zero-shot_, _finetuning_, and _instruction-tuning_.
|
57 |
+
Specifically, CodeT5+ yields substantial performance gains on many downstream tasks compared to their SoTA baselines, e.g.,
|
58 |
+
8 text-to-code retrieval tasks (+3.2 avg. MRR), 2 line-level code completion tasks (+2.1 avg. Exact Match), and 2 retrieval-augmented code generation tasks (+5.8 avg. BLEU-4).
|
59 |
+
In 2 math programming tasks on MathQA-Python and GSM8K-Python, CodeT5+ models of below billion-parameter sizes significantly outperform many LLMs of up to 137B parameters.
|
60 |
+
Particularly, in the zero-shot text-to-code generation task on HumanEval benchmark, InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001 mode
|
61 |
+
Please refer to the [paper](https://arxiv.org/pdf/2305.07922.pdf) for more details.
|
62 |
+
|
63 |
+
|
64 |
+
## BibTeX entry and citation info
|
65 |
+
|
66 |
+
```bibtex
|
67 |
+
@article{wang2023codet5plus,
|
68 |
+
title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
|
69 |
+
author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
|
70 |
+
journal={arXiv preprint},
|
71 |
+
year={2023}
|
72 |
+
}
|
73 |
+
```
|