File size: 2,028 Bytes
bcbf898
 
 
 
 
 
12c19cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
datasets:
- nuprl/EditPackFT-Multi
tags:
- code
---
# What is this
This is a deepseek coder 7b model trained to predict commit messages for a diff.

# Languages trained on:
```py
LANGS = [
    "Python",
    "Rust",
    "JavaScript",
    "Java",
    "Go",
    "C++",
    "C#",
    "Ruby",
    "PHP",
    "TypeScript",
    "C",
    "Scala",
    "Swift",
    "Kotlin",
    "Objective-C",
    "Perl",
    "Haskell",
    "Bash",
    "Sh",
    "Lua",
    "R",
    "Julia",
]
```

# How to prompt:
```python
import difflib
class NDiff:
    def __init__(self, s1, s2):
        self.s1 = s1
        self.s2 = s2
        self.diff = difflib.ndiff(s1.split("\n"), s2.split("\n"))

    def __str__(self):
        return "\n".join([l for l in self.diff if l[0] != "?"])

    def str_colored(self):
        import colored

        buf = ""
        for l in self.diff:
            if l[0] == "?":
                continue
            if l[0] == "-":
                buf += colored.stylize(l, colored.fg("red"))
            elif l[0] == "+":
                buf += colored.stylize(l, colored.fg("green"))
            else:
                buf += l
            buf += "\n"
        return buf

    def num_removed(self):
        return len([l for l in self.diff if l[0] == "-"])

    def num_added(self):
        return len([l for l in self.diff if l[0] == "+"])

    def __repr__(self):
        return self.__str__()

def format_prompt(old, new):
    diff_header = "<diff>"
    instr_header = "<commit_message>"
    diff = str(NDiff(old, new))
    return f"{diff_header}\n{diff}\n{instr_header}\n"

def gen(old, new, max_new_tokens=200, temperature=0.45, top_p=0.90):
    prompt = format_prompt(old, new)
    toks = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    outs = model.generate(toks, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature, top_p=top_p)
    return [tokenizer.decode(out[len(toks[0]):], skip_special_tokens=True) for out in outs]
```

use the "gen" function with the old and new code