CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Abstract
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.
Community
Does this code contain http api call:
import requests
Proposes CodeT5+: a family of encoder-decoder (shallow encoder and deep decoder) large language models (LLMs) for downstream code tasks. Multilingual code corpora with many objectives (span denoising, contrastive learning, text-code matching, etc.). Use off-the-shelf code LLMs (CodeGen) for initialization. Encoder has bidirectional self-attention with feed-forward layers; decoder has causal self attention, cross attention, and feed forward layers. Stage 1: unimodal pretraining on code data; randomly replace tokens (and sub-words) with indexed sentinel tokens (like in T5); casual language modeling where the model should predict code after a pivot sequence (special CLM token at first position). Stage 2: text-code contrastive learning for the encoder, with momentum encoder (like MoCo - momentum contrastive learning); text-code matching (prediction) for the decoder; text-code causal LM for encoder and decoder, text-to-code and code-to-text cross-modal relationship. Pre-training on GitHub code dataset. Fine tuned with instructions. Comparisons with StarCoder, GPT-4, LLaMA, etc.: not so good at HumanEval, good at MathQA, code completion, and retrieval. Losses and downstream fine-tuning details in appendix. From Salesforce.
Models citing this paper 19
Browse 19 models citing this paperDatasets citing this paper 0
No dataset linking this paper