Identifier-Renaming

Generating higher quality variable names for code by renaming masked variable names.

Model Details

Model Description

Model type: Masked Language model
Language(s) (NLP): Coded in Python to handle Java code
Finetuned from model: GraphCodeBERT

Model Sources [optional]

Repository: https://anonymous.4open.science/r/Identifier-Renaming-653F

Uses

Input Java code snippets with all instances of a particular variable name replaced by "[MASK]"
Input the number of tokens desired in the variable name (how long should it be). Else, input "0" to get a random number of tokens sampled from training data distribution
The code snippets must ideally be entire classes for best results. A prediction for the masked variable name is presented as output.

Out-of-Scope Use

This non-fine-tuned version of the model is designed for generic code completion tasks. The fine-tuned model is designed to focus solely on identifier names.
Ensure all instances of a particular variable name are masked.

Bias, Risks, and Limitations

Training is only done for a relatively small dataset and few epochs, and thus, the model might be under-trained.
Even with the correct output, the syntax of the model can be occasionally dubious.
The model is not perfect, and identifier renamings must be reviewed till performance in test settings is not evaluated.

Recommendations

Use the model as described and verify outputs before using them.

How to Get Started with the Model

Clone the repository and load model state dict using 'model_26_2'

Training Details

Trained on a subset of a dataset of 1000 classes with 612 lines of code on average for 3 epochs and a Learning Rate of 2e-5.

Evaluation

227 Java classes used for evaluation

Perplexty of Base Model: 37580
Perplexity of Fine-tuned Model: 23

Metrics

Perplexity is used to evaluate the performance of the model. It judges how surprising it is for a model to predict the given text.