File size: 3,822 Bytes
52615a9 7f55b90 fdb96c8 7f55b90 73ef5f5 7f55b90 52615a9 7f55b90 d56ff99 4ef908c d56ff99 4ef908c d56ff99 7f55b90 d56ff99 7f55b90 73ef5f5 7f55b90 73ef5f5 09e3de1 73ef5f5 7f55b90 73ef5f5 7f55b90 73ef5f5 d56ff99 73ef5f5 d56ff99 73ef5f5 7f55b90 d56ff99 7f55b90 4ef908c d56ff99 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
language:
- en
license: apache-2.0
tags:
- solidity
- web3
- code generation
- smart contract
widget:
- text: "pragma solidity ^0.5.7;\n// Context: ParentA | Functions: helloA helloB | Constants: constantA \ncontract HelloWorld is ParentA {"
---
# A code generation T5 model for solidity (web3 smart contract)
- See https://github.com/hululuzhu/solidity-t5 for more context
## How to use this trained model
- A hello world example to use this model, notice the input `text` includes
- Header solidity version like `pragma solidity ^0.5.7`
- Ancestor class/library info, e.g. public functions and constants from `ParentA`
- Contract/Library/Interface declaration header, e.g. `HelloWorld` ended with `{`
- Or simply use the test widget on the right side of the window and test, however
the quality is known to be worse without decoding params
```python
# !pip install transformers -q
from transformers import AutoTokenizer, T5ForConditionalGeneration
DEVICE = 'cuda' # fallback to cpu if you do not have cuda
tokenizer = AutoTokenizer.from_pretrained("hululuzhu/solidity-t5")
model = T5ForConditionalGeneration.from_pretrained("hululuzhu/solidity-t5").to(DEVICE)
text = """pragma solidity ^0.5.7;
// Context: ParentA | Functions: helloA helloB | Constants: constantA
contract HelloWorld is ParentA {"""
input_ids = tokenizer(text, return_tensors="pt", truncation=True).input_ids.to(DEVICE)
# Need to tune beam/topk/topp params to get good outcome
generated_ids = model.generate(input_ids, max_length=256, num_beams=5, top_p=0.95, top_k=50)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# Expect outcome
"""
string public constant name = "Hello World";
...
uint256 public constant override returns (uint256) {
return initialSupply;
}
function initialSupply() public view returns (uint256) {
...
"""
```
## Background
- Base T5 code model: https://huggingface.co/Salesforce/codet5-large
- Source data: https://huggingface.co/datasets/mwritescode/slither-audited-smart-contracts
- Processing steps: Clean, contract-level segmentation sepration, split in and out
- After processing input sample
```
pragma solidity 0.5.7;
// Context: PauserRole | Functions: isPauser addPauser renouncePauser | Constants:
contract Pausable is PauserRole {
```
- After processing output sample (**notice indentation is bad, this is intentional to reduce token size**)
```
event Paused(address account);
event Unpaused(address account);
bool private _pausableActive;
bool private _paused;
constructor () internal {
_paused = false;
}
function paused() public view returns (bool) {
return _paused;
}
modifier whenNotPaused() {
require(!_paused);
_;
}
modifier whenPaused() {
require(_paused);
_;
}
function pause() public onlyPauser whenNotPaused whenPausableActive {
_paused = true;
emit Paused(msg.sender);
}
function unpause() public onlyPauser whenPaused whenPausableActive {
_paused = false;
emit Unpaused(msg.sender);
}
function _setPausableActive(bool _active) internal {
_pausableActive = _active;
}
modifier whenPausableActive() {
require(_pausableActive);
_;
}
}
```
- Source training code: See the [end to end notebook](https://github.com/hululuzhu/solidity-t5/blob/main/code/Solidity_T5_Data_Processing_and_Training.ipynb) at code dir here
## Future TODO
- The model is significantly under-trained because of lack of GPU budget, need 10x colab resources (~$100 for full train)
- This is quite limited on how the model is used, potentially we could switch to GPT2 decoder-only to compare, but CodeT5 has its strong code optimization
- Need more classifiers (T5 or BERT alike) to detect potential defects.
|