Reverb commited on
Commit
4f4b56f
1 Parent(s): 9082184

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md CHANGED
@@ -1,3 +1,51 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # GPyT Project
6
+ GPyT is a GPT2 model trained from scratch (not fine tuned) on Python code from Github. Overall, it was ~200GB of pure
7
+ Python code, the current GPyT model is a mere 2 epochs through this data, so it may benefit greatly from continued training and/or fine-tuning.
8
+
9
+ Newlines are replaced by <N>
10
+
11
+ Input to the model is code, up to the context length of 1024, with newlines replaced by <N>
12
+
13
+ Here's a quick example of using this model:
14
+
15
+ ```py
16
+ from transformers import AutoTokenizer, AutoModelWithLMHead
17
+
18
+ tokenizer = AutoTokenizer.from_pretrained("Sentdex/GPyT")
19
+ model = AutoModelWithLMHead.from_pretrained("Sentdex/GPyT")
20
+
21
+ # copy and paste some code in here
22
+ inp = """import"""
23
+
24
+ newlinechar = "<N>"
25
+ converted = inp.replace("\n", newlinechar)
26
+ tokenized = tokenizer.encode(converted, return_tensors='pt')
27
+ resp = model.generate(tokenized)
28
+
29
+ decoded = tokenizer.decode(resp[0])
30
+ reformatted = decoded.replace("<N>","\n")
31
+
32
+ print(reformatted)
33
+ ```
34
+
35
+ Should produce:
36
+
37
+ ```py
38
+ import numpy as np
39
+ import pytest
40
+
41
+ import pandas as pd<N
42
+ ```
43
+
44
+ This model does a ton more than just imports, however. For a bunch of examples and a better understanding of the model's capabilities: https://pythonprogramming.net/GPT-python-code-transformer-model-GPyT/
45
+
46
+ ## Considerations:
47
+
48
+ > - This model is intended for educational and research use only. Do not trust model outputs.
49
+ > - Model is highly likely to regurgitate code almost exactly as it saw it. It's up to you to determine licensing if you intend to actually use the generated code.
50
+ > - All Python code was blindly pulled from github. This means included code is both Python 2 and 3, among other more subtle differences, such as tabs being 2 spaces in some cases and 4 in others...and more non-homologous things.
51
+ > - Along with the above, this means the code generated could wind up doing or suggesting just about anything. Run the generated code at own risk...it could be anything