NamCyan's picture
Update README.md
849dd19 verified
metadata
library_name: transformers
datasets:
  - NamCyan/tesoro-code
base_model:
  - microsoft/codebert-base

Improving the detection of technical debt in Java source code with an enriched dataset

Model Details

Model Description

This model is the part of Tesoro project, used for detecting technical debt in source code. More information can be found at Tesoro HomePage.

  • Developed by: Nam Hai Le
  • Model type: Encoder-based PLMs
  • Language(s): Java
  • Finetuned from model: CodeBERT

Model Sources

  • Repository: Tesoro
  • Paper: [To be update]

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NamCyan/codebert-base-technical-debt-code-tesoro")
model = AutoModelForSequenceClassification.from_pretrained("NamCyan/codebert-base-technical-debt-code-tesoro")

Training Details

  • Training Data: The model is finetuned using tesoro-code

  • Infrastructure: Training process is conducted on two NVIDIA A100 GPUs with 80GB of VRAM.

Leaderboard

Model Model size EM F1
Encoder-based PLMs
CodeBERT 125M 38.28 43.47
UniXCoder 125M 38.12 42.58
GraphCodeBERT 125M 39.38 44.21
RoBERTa 125M 35.37 38.22
ALBERT 11.8M 39.32 41.99
Encoder-Decoder-based PLMs
PLBART 140M 36.85 39.90
Codet5 220M 32.66 35.41
CodeT5+ 220M 37.91 41.96
Decoder-based PLMs (LLMs)
TinyLlama 1.03B 37.05 40.05
DeepSeek-Coder 1.28B 42.52 46.19
OpenCodeInterpreter 1.35B 38.16 41.76
phi-2 2.78B 37.92 41.57
starcoder2 3.03B 35.37 41.77
CodeLlama 6.74B 34.14 38.16
Magicoder 6.74B 39.14 42.49

Citing us

@article{nam2024tesoro,
  title={Improving the detection of technical debt in Java source code with an enriched dataset},
  author={Hai, Nam Le and Bui, Anh M. T. Bui and Nguyen, Phuong T. and Ruscio, Davide Di and Kazman, Rick},
  journal={},
  year={2024}
}