File size: 3,043 Bytes

608d434
 
fc51336
 
15cd0e2
fc51336
b83a16b
fc51336
 
71c65ef
15cd0e2
 
 
6840889
051e153
15cd0e2
 
0adad45
 
 
 
 
 
 
 
12408c7
0adad45
 
 
130ff45
0adad45
15cd0e2
 
 
 
 
 
12408c7
 
130ff45
12408c7
15cd0e2
 
130ff45
15cd0e2
12408c7
 
130ff45
12408c7
15cd0e2
 
130ff45
15cd0e2
12408c7
 
7fc5167
12408c7
15cd0e2
 
7fc5167
15cd0e2
608d434
fc51336
fda6687
fc51336
 
 
8b4238c
fc51336
 
 
 
 
 
 
 
 
15cd0e2
fe38445
15cd0e2
fc51336
 
b943618
fc51336
b943618
fc51336
b943618
 
 
fc51336
b943618
 
fc51336
b943618
11c4e7f
 
fc51336
11c4e7f
 
 
b943618
 
 
 
fc51336
 
 
45fcf9f
 
fc51336
45fcf9f
fc51336
45fcf9f
fc51336
 
 
45fcf9f
fc51336
 
 
45fcf9f
 
 
 
fc51336
 
 
36fa1bd

---
license: apache-2.0
datasets:
- lambdasec/cve-single-line-fixes
- lambdasec/gh-top-1000-projects-vulns
language:
- code
tags:
- code
programming_language:
- Java
- JavaScript
- Python
inference: false
model-index:
- name: SantaFixer
  results:
  - task:
      type: text-generation
    dataset:
      type: openai/human-eval-infilling
      name: HumanEval
    metrics:
    - name: single-line infilling pass@1
      type: pass@1
      value: 0.47
      verified: false
    - name: single-line infilling pass@10
      type: pass@10
      value: 0.74
      verified: false
  - task:
      type: text-generation
    dataset:
      type: lambdasec/gh-top-1000-projects-vulns
      name: GH Top 1000 Projects Vulnerabilities
    metrics:
    - name: pass@1 (Java)
      type: pass@1
      value: 0.26
      verified: false
    - name: pass@10 (Java)
      type: pass@10
      value: 0.48
      verified: false
    - name: pass@1 (Python)
      type: pass@1
      value: 0.31
      verified: false
    - name: pass@10 (Python)
      type: pass@10
      value: 0.56
      verified: false
    - name: pass@1 (JavaScript)
      type: pass@1
      value: 0.36
      verified: false
    - name: pass@10 (JavaScript)
      type: pass@10
      value: 0.62
      verified: false
---

# Model Card for SantaFixer

<!-- Provide a quick summary of what the model is/does. -->

This is a LLM for code that is focussed on generating bug fixes using infilling. 

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** [codelion](https://huggingface.co/codelion)
- **Model type:** GPT-2
- **Finetuned from model:** [bigcode/santacoder](https://huggingface.co/bigcode/santacoder)


## How to Get Started with the Model

Use the code below to get started with the model.

```python
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "lambdasec/santafixer"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
              trust_remote_code=True).to(device)

input_text = "<fim-prefix>def print_hello_world():\n
              <fim-suffix>\n print('Hello world!')
              <fim-middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```

## Training Details

- **GPU:** Tesla P100
- **Time:** ~5 hrs

### Training Data

The model was fine-tuned on the [CVE single line fixes dataset](https://huggingface.co/datasets/lambdasec/cve-single-line-fixes)

### Training Procedure 

Supervised Fine Tuning (SFT)

#### Training Hyperparameters

- **optim:** adafactor
- **gradient_accumulation_steps:** 4
- **gradient_checkpointing:** true
- **fp16:** false

## Evaluation

The model was tested with the [GitHub top 1000 projects vulnerabilities dataset](https://huggingface.co/datasets/lambdasec/gh-top-1000-projects-vulns)