metadata

license: apache-2.0
tags: null
datasets:
  - code_search_net

CodeT5 (base-sized model)

Pre-trained CodeT5 model. It was introduced in the paper CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi and first released in this repository.

Disclaimer: The team releasing CodeT5 did not write a model card for this model so this model card has been written by the Hugging Face team (more specifically, nielsr).

Model description

From the abstract:

"We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code."

Intended uses & limitations

You can use the model to fine-tune it on code understanding tasks, such as . See the model hub to look for fine-tuned versions on a task that interests you.

How to use

Here is how to use this model:

from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')

text = "def greet(user): print(f'hello <extra_id_0>!') </s>"
inputs = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# this prints {user.name}

# or, generating 20 sequences with maximum length set to 10
outputs = model.generate(input_ids=input_ids, 
                          num_beams=200, num_return_sequences=20,
                          max_length=10)

_0_index = text.index('<extra_id_0>')
_result_prefix = text[:_0_index]
_result_suffix = text[_0_index+12:]  # 12 is the length of <extra_id_0>

def _filter(output, end_token='<extra_id_1>'):
    # The first token is <pad> (indexed at 0), the second token is <s> (indexed at 1)
    # and the third token is <extra_id_0> (indexed at 32099)
    # So we only decode from the fourth generated id
    _txt = tokenizer.decode(output[3:], skip_special_tokens=False, clean_up_tokenization_spaces=False)
    if end_token in _txt:
        _end_token_index = _txt.index(end_token)
        return _result_prefix + _txt[:_end_token_index] + _result_suffix
    else:
        return _result_prefix + _txt + _result_suffix

results = list(map(_filter, outputs))
print(results)
# this prints:
#["def greet(user): print(f'hello {user.name} {user!') </s>",
# "def greet(user): print(f'hello {user.username} {user!') </s>",
# "def greet(user): print(f'hello {user.name}: {user!') </s>",
# "def greet(user): print(f'hello {user}') print(f!') </s>",
# "def greet(user): print(f'hello {user.name} �!') </s>",
# "def greet(user): print(f'hello {user}') print ( f!') </s>",
# "def greet(user): print(f'hello {user.username}: {user!') </s>",
# "def greet(user): print(f'hello {user}' ) print(f!') </s>",
# "def greet(user): print(f'hello {user.username} �!') </s>",
# "def greet(user): print(f'hello {user.name}, {user!') </s>",
# "def greet(user): print(f'hello {user.login} {user!') </s>",
# "def greet(user): print(f'hello {user} →!') </s>",
# "def greet(user): print(f'hello {user}!') print(!') </s>",
# "def greet(user): print(f'hello {user.name} ({user!') </s>",
# "def greet(user): print(f'hello {user.email} {user!') </s>",
# "def greet(user): print(f'hello {user}!') print (!') </s>",
# "def greet(user): print(f'hello {user.username}, {user!') </s>",
# "def greet(user): print(f'hello {user}' ) print ( f!') </s>",
# "def greet(user): print(f'hello {user.nickname} {!') </s>",
# "def greet(user): print(f'hello {user} {user.name!') </s>"]

Training data

The CodeT5 model was pretrained on CodeSearchNet Husain et al., 2019. Additionally, the authors collected two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped programming languages with the pre-training data. In total, around 8.35 million instances are used for pretraining.

Training procedure

Preprocessing

This model uses a code-specific BPE (Byte-Pair Encoding) tokenizer. One can prepare text (or code) for the model using RobertaTokenizer, with the files from this repository.

Evaluation results

For evaluation results on several downstream benchmarks, we refer to the paper.

BibTeX entry and citation info

@misc{wang2021codet5,
      title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, 
      author={Yue Wang and Weishi Wang and Shafiq Joty and Steven C. H. Hoi},
      year={2021},
      eprint={2109.00859},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}