README.md · Fsoft-AIC/Codebert-docstring-inconsistency at 34445e4d34dedd927295c636a75282db3535eec3

metadata

language:
  - code
  - en
task_categories:
  - text-classification
tags:
  - arxiv:2305.06156
metrics:
  - accuracy
widget:
  - text: |-
      Sum two integers</s></s>def sum(a, b):
          return a + b
    example_title: Simple toy
  - text: >-
      Look for methods that might be dynamically defined and define them for
      lookup.</s></s>def respond_to_missing?(name, include_private = false)
        if name == :to_ary || name == :empty?
          false
        else
          return true if mapping(name).present?
          mounting = all_mountings.find{ |mount| mount.respond_to?(name) }
          return false if mounting.nil?
        end
      end
    example_title: Ruby example
  - text: >-
      Method that adds a candidate to the party @param c the candidate that will
      be added to the party</s></s>public void addCandidate(Candidate c)

      {
          this.votes += c.getVotes(); 
          candidates.add(c); 
      }
    example_title: Java example
  - text: |-
      we do not need Buffer pollyfill for now</s></s>function(str){
        var ret = new Array(str.length), len = str.length;
        while(len--) ret[len] = str.charCodeAt(len);
        return Uint8Array.from(ret);
      }
    example_title: JavaScript example
pipeline_tag: text-classification

Model Description
Model Details
Usage
Limitations
Additional Information
- Licensing Information
- Citation Information

Model Description

This model is trained based on Codebert and a 5M subset of The Vault to detect the inconsistency between docstring/comment and function. It is used to remove noise examples in The Vault dataset.

More information:

Repository: FSoft-AI4Code/TheVault
Paper: The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Contact: support.ailab@fpt.com

Model Details

Developed by: Fsoft AI Center
License: Updating
Model type: Transformer-Encoder based Language Model
Architecture: BERT-base
Data set: The Vault
Tokenizer: Byte Pair Encoding
Vocabulary Size: 50265
Sequence Length: 512
Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby)
Training details:
- Self-supervised learning, binary classification
- Positive class: Original code-docstring pair
- Negative class: Random pairing code and docstring

Usage

The input to the model follows the below template:

"""
Template:
<s>{docstring}</s></s>{code}</s>

Example:
from transformers import AutoTokenizer

#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

input = "<s>Sum two integers</s></s>def sum(a, b):\n    return a + b</s>"
tokenized_input = tokenizer(input, add_special_tokens= False)
"""

Using model with Jax

from transformers import AutoTokenizer, FlaxAutoModelForSequenceClassification

#Load jax model
model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

Using model with Pytorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification

#Load torch model
model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

Limitations

This model is trained on a subset of 5M data in The Vault in the self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted.

It is hard to evaluate the model due to the unavailable labeled datasets. ChatGPT is adopted as a reference to measure the correlation between the model and ChatGPT's scores. However, the result could be influenced by ChatGPT's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and finetune this model to achieve the best result.

Additional information

Licensing Information

[More information needed]

Citation Information

@article{thevault,
  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
  author={},
  journal={},
  pages={},
  year={2023}
}

Fsoft-AIC
/

Codebert-docstring-inconsistency

Table of Contents