yuewang-sf
commited on
Commit
•
94f88f9
1
Parent(s):
ad0691e
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,113 @@
|
|
1 |
---
|
2 |
license: bsd-3-clause
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: bsd-3-clause
|
3 |
---
|
4 |
+
|
5 |
+
# CodeT5+ 110M Embedding Models
|
6 |
+
|
7 |
+
## Model description
|
8 |
+
|
9 |
+
[CodeT5+](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) is a new family of open code large language models
|
10 |
+
with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_,
|
11 |
+
and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
|
12 |
+
It is introduced in the paper:
|
13 |
+
|
14 |
+
[CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
|
15 |
+
by [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (*
|
16 |
+
indicates equal contribution).
|
17 |
+
|
18 |
+
Compared to the original CodeT5 family (base: `220M`, large: `770M`), CodeT5+ is pretrained with a diverse set of
|
19 |
+
pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code
|
20 |
+
matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
|
21 |
+
Additionally, it employs a simple yet effective _compute-efficient pretraining_ method to initialize the model
|
22 |
+
components with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen) to efficiently scale
|
23 |
+
up the model (i.e. `2B`, `6B`, `16B`), and adopts a "shallow encoder and deep decoder" architecture.
|
24 |
+
Furthermore, it is instruction-tuned to align with natural language instructions (see our InstructCodeT5+ 16B)
|
25 |
+
following [Code Alpaca](https://github.com/sahil280114/codealpaca).
|
26 |
+
|
27 |
+
## How to use
|
28 |
+
|
29 |
+
This checkpoint consists of an encoder of CodeT5+ 220M model (pretrained from 2 stages on both unimodal and bimodal) and a projection layer, which can be used to extract code
|
30 |
+
embeddings of 256 dimension. It can be easily loaded using the `AutoModel` functionality and employs the
|
31 |
+
same [CodeT5](https://github.com/salesforce/CodeT5) tokenizer.
|
32 |
+
|
33 |
+
```python
|
34 |
+
from transformers import AutoModel, AutoTokenizer
|
35 |
+
|
36 |
+
checkpoint = "Salesforce/codet5p-110m-embedding"
|
37 |
+
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
38 |
+
|
39 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
|
40 |
+
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
|
41 |
+
|
42 |
+
inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
|
43 |
+
embedding = model(inputs)[0]
|
44 |
+
print(f'Dimension of the embedding: {embedding.size()[0]}, with norm={embedding.norm().item()}')
|
45 |
+
# Dimension of the embedding: 256, with norm=1.0
|
46 |
+
print(embedding)
|
47 |
+
# tensor([ 0.0185, 0.0229, -0.0315, -0.0307, -0.1421, -0.0575, -0.0275, 0.0501,
|
48 |
+
# 0.0203, 0.0337, -0.0067, -0.0075, -0.0222, -0.0107, -0.0250, -0.0657,
|
49 |
+
# 0.1571, -0.0994, -0.0370, 0.0164, -0.0948, 0.0490, -0.0352, 0.0907,
|
50 |
+
# -0.0198, 0.0130, -0.0921, 0.0209, 0.0651, 0.0319, 0.0299, -0.0173,
|
51 |
+
# -0.0693, -0.0798, -0.0066, -0.0417, 0.1076, 0.0597, -0.0316, 0.0940,
|
52 |
+
# -0.0313, 0.0993, 0.0931, -0.0427, 0.0256, 0.0297, -0.0561, -0.0155,
|
53 |
+
# -0.0496, -0.0697, -0.1011, 0.1178, 0.0283, -0.0571, -0.0635, -0.0222,
|
54 |
+
# 0.0710, -0.0617, 0.0423, -0.0057, 0.0620, -0.0262, 0.0441, 0.0425,
|
55 |
+
# -0.0413, -0.0245, 0.0043, 0.0185, 0.0060, -0.1727, -0.1152, 0.0655,
|
56 |
+
# -0.0235, -0.1465, -0.1359, 0.0022, 0.0177, -0.0176, -0.0361, -0.0750,
|
57 |
+
# -0.0464, -0.0846, -0.0088, 0.0136, -0.0221, 0.0591, 0.0876, -0.0903,
|
58 |
+
# 0.0271, -0.1165, -0.0169, -0.0566, 0.1173, -0.0801, 0.0430, 0.0236,
|
59 |
+
# 0.0060, -0.0778, -0.0570, 0.0102, -0.0172, -0.0051, -0.0891, -0.0620,
|
60 |
+
# -0.0536, 0.0190, -0.0039, -0.0189, -0.0267, -0.0389, -0.0208, 0.0076,
|
61 |
+
# -0.0676, 0.0630, -0.0962, 0.0418, -0.0172, -0.0229, -0.0452, 0.0401,
|
62 |
+
# 0.0270, 0.0677, -0.0111, -0.0089, 0.0175, 0.0703, 0.0714, -0.0068,
|
63 |
+
# 0.1214, -0.0004, 0.0020, 0.0255, 0.0424, -0.0030, 0.0318, 0.1227,
|
64 |
+
# 0.0676, -0.0723, 0.0970, 0.0637, -0.0140, -0.0283, -0.0120, 0.0343,
|
65 |
+
# -0.0890, 0.0680, 0.0514, 0.0513, 0.0627, -0.0284, -0.0479, 0.0068,
|
66 |
+
# -0.0794, 0.0202, 0.0208, -0.0113, -0.0747, 0.0045, -0.0854, -0.0609,
|
67 |
+
# -0.0078, 0.1168, 0.0618, -0.0223, -0.0755, 0.0182, -0.0128, 0.1116,
|
68 |
+
# 0.0240, 0.0342, 0.0119, -0.0235, -0.0150, -0.0228, -0.0568, -0.1528,
|
69 |
+
# 0.0164, -0.0268, 0.0727, -0.0569, 0.1306, 0.0643, -0.0158, -0.1070,
|
70 |
+
# -0.0107, -0.0139, -0.0363, 0.0366, -0.0986, -0.0628, -0.0277, 0.0316,
|
71 |
+
# 0.0363, 0.0038, -0.1092, -0.0679, -0.1398, -0.0648, 0.1711, -0.0666,
|
72 |
+
# 0.0563, 0.0581, 0.0226, 0.0347, -0.0672, -0.0229, -0.0565, 0.0623,
|
73 |
+
# 0.1089, -0.0687, -0.0901, -0.0073, 0.0426, 0.0870, -0.0390, -0.0144,
|
74 |
+
# -0.0166, 0.0262, -0.0310, 0.0467, -0.0164, -0.0700, -0.0602, -0.0720,
|
75 |
+
# -0.0386, 0.0067, -0.0337, -0.0053, 0.0829, 0.1004, 0.0427, 0.0026,
|
76 |
+
# -0.0537, 0.0951, 0.0584, -0.0583, -0.0208, 0.0124, 0.0067, 0.0403,
|
77 |
+
# 0.0091, -0.0044, -0.0036, 0.0524, 0.1103, -0.1511, -0.0479, 0.1709,
|
78 |
+
# 0.0772, 0.0721, -0.0332, 0.0866, 0.0799, -0.0581, 0.0713, 0.0218],
|
79 |
+
# device='cuda:0', grad_fn=<SelectBackward0>)
|
80 |
+
```
|
81 |
+
|
82 |
+
## Pretraining data
|
83 |
+
|
84 |
+
This checkpoint is trained on the stricter permissive subset of the deduplicated version of
|
85 |
+
the [github-code dataset](https://huggingface.co/datasets/codeparrot/github-code).
|
86 |
+
The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”,
|
87 |
+
“cc0-1.0”, “unlicense”, “isc”).
|
88 |
+
Supported languages (9 in total) are as follows:
|
89 |
+
`c`, `c++`, `c-sharp`, `go`, `java`, `javascript`, `php`, `python`, `ruby.`
|
90 |
+
|
91 |
+
## Training procedure
|
92 |
+
|
93 |
+
This checkpoint is first trained on the unimodal code data at the first-stage pretraining and then on bimodal text-code
|
94 |
+
pair data using the proposed mixture of pretraining tasks.
|
95 |
+
Please refer to the paper for more details.
|
96 |
+
|
97 |
+
## Evaluation results
|
98 |
+
|
99 |
+
We show the zero-shot results of this checkpoint on 6 downstream code retrieval tasks from CodeXGLUE in the following table.
|
100 |
+
| Ruby | JavaScript | Go | Python | Java | PHP | Overall |
|
101 |
+
| ----- | ---------- | ----- | ------ | ----- | ----- | ------- |
|
102 |
+
| 74.51 | 69.07 | 90.69 | 71.55 | 71.82 | 67.72 | 74.23 |
|
103 |
+
|
104 |
+
## BibTeX entry and citation info
|
105 |
+
|
106 |
+
```bibtex
|
107 |
+
@article{wang2023codet5plus,
|
108 |
+
title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
|
109 |
+
author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
|
110 |
+
journal={arXiv preprint},
|
111 |
+
year={2023}
|
112 |
+
}
|
113 |
+
```
|