--- license: bsd-3-clause --- # CodeT5 (large-size model 770M) ## Model description CodeT5 is a family of encoder-decoder language models for code from the paper: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. The checkpoint included in this repository is denoted as **CodeT5-large** (770M), which is introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi. ## Training data CodeT5-large was pretrained on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data in six programming languages (Ruby/JavaScript/Go/Python/Java/PHP). See Section 4.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. ## Training procedure CodeT5-large was pretrained using masked span prediction objective for 150 epochs. See Section 4.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. ## Evaluation results We validate the effectiveness of this checkpoint pretrained with simplified strategies on [CodeXGLUE](https://github.com/microsoft/CodeXGLUE) benchmark. See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details. ## How to use This model can be easily loaded using the `T5ForConditionalGeneration` functionality: ```python from transformers import AutoTokenizer, T5ForConditionalGeneration tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large") model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large") text = "def greet(user): print(f'hello !')" input_ids = tokenizer(text, return_tensors="pt").input_ids # simply generate a single sequence generated_ids = model.generate(input_ids, max_length=8) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) ``` ## Ethical Considerations This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. ## BibTeX entry and citation info ```bibtex @inproceedings{CodeT52021, author = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi}, title = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, booktitle = {EMNLP}, pages = {8696--8708}, publisher = {Association for Computational Linguistics}, year = {2021} } @article{CodeRL2022 author = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi}, title = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning}, journal = {arXiv preprint}, volume = {abs/2207.01780}, year = {2022} } ```