claudios
/

VulBERTa-MLP-Devign

@@ -41,17 +41,17 @@ tags:
 ---
 # VulBERTa MLP Devign
-## VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection
 ![VulBERTa architecture](https://raw.githubusercontent.com/ICL-ml4csec/VulBERTa/main/VB.png)
 ## Overview
-This model is the unofficial HuggingFace version of "VulBERTa" with an MLP classification head, trained on CodeXGlue Devign, by Hazim Hanif & Sergio Maffeis (Imperial College London).
 > This paper presents presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
 ## Usage
-*You must install libclang for tokenization.*
 ```bash
 pip install libclang
@@ -67,6 +67,8 @@ pipe("static void filter_mirror_setup(NetFilterState *nf, Error **errp)\n{\n
   {'label': 'LABEL_1', 'score': 0.985314130783081}]]
 ```
 ## Data
 We provide all data required by VulBERTa.
 This includes:
@@ -85,18 +87,6 @@ This includes:
 Please refer to the [models](https://github.com/ICL-ml4csec/VulBERTa/tree/main/models "models") directory for further instructions and details.
-## Pre-requisites and requirements
-In general, we used this version of packages when running the experiments:
- - Python 3.8.5
- - Pytorch 1.7.0
- - Transformers 4.4.1
- - Tokenizers 0.10.1
- - Libclang (any version > 12.0 should work. https://pypi.org/project/libclang/)
-For an exhaustive list of all the packages, please refer to [requirements.txt](https://github.com/ICL-ml4csec/VulBERTa/blob/main/requirements.txt "requirements.txt") file.
 ## How to use
 In our project, we uses Jupyterlab notebook to run experiments.
@@ -107,9 +97,6 @@ Therefore, we separate each task into different notebook:
  - [Evaluation_VulBERTa-MLP.ipynb](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Evaluation_VulBERTa-MLP.ipynb "Evaluation_VulBERTa-MLP.ipynb") - Evaluates the fine-tuned VulBERTa-MLP models on testing set of a specific vulnerability detection dataset.
  - [Finetuning+evaluation_VulBERTa-CNN](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Finetuning%2Bevaluation_VulBERTa-CNN.ipynb "Finetuning+evaluation_VulBERTa-CNN.ipynb") - Fine-tunes VulBERTa-CNN models and evaluates it on a testing set of a specific vulnerability detection dataset.
-## Running VulBERTa-CNN or VulBERTa-MLP on arbitrary codes
-Coming soon!
 ## Citation

 ---
 # VulBERTa MLP Devign
+## [VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection](https://github.com/ICL-ml4csec/VulBERTa/tree/main)
 ![VulBERTa architecture](https://raw.githubusercontent.com/ICL-ml4csec/VulBERTa/main/VB.png)
 ## Overview
+This model is the unofficial HuggingFace version of "[VulBERTa](https://github.com/ICL-ml4csec/VulBERTa/tree/main)" with an MLP classification head, trained on CodeXGlue Devign (C code), by Hazim Hanif & Sergio Maffeis (Imperial College London). I simplified the tokenization process by adding the cleaning (comment removal) step to the tokenizer and added the simplified tokenizer to this model repo as an AutoClass.
 > This paper presents presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
 ## Usage
+**You must install libclang for tokenization.**
 ```bash
 pip install libclang
   {'label': 'LABEL_1', 'score': 0.985314130783081}]]
 ```
+***
 ## Data
 We provide all data required by VulBERTa.
 This includes:
 Please refer to the [models](https://github.com/ICL-ml4csec/VulBERTa/tree/main/models "models") directory for further instructions and details.
 ## How to use
 In our project, we uses Jupyterlab notebook to run experiments.
  - [Evaluation_VulBERTa-MLP.ipynb](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Evaluation_VulBERTa-MLP.ipynb "Evaluation_VulBERTa-MLP.ipynb") - Evaluates the fine-tuned VulBERTa-MLP models on testing set of a specific vulnerability detection dataset.
  - [Finetuning+evaluation_VulBERTa-CNN](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Finetuning%2Bevaluation_VulBERTa-CNN.ipynb "Finetuning+evaluation_VulBERTa-CNN.ipynb") - Fine-tunes VulBERTa-CNN models and evaluates it on a testing set of a specific vulnerability detection dataset.
 ## Citation