File size: 3,822 Bytes

52615a9
7f55b90
fdb96c8
7f55b90
 
 
 
 
73ef5f5
7f55b90
 
52615a9
7f55b90
d56ff99
4ef908c
d56ff99
4ef908c
d56ff99
7f55b90
 
 
d56ff99
 
7f55b90
 
73ef5f5
 
7f55b90
73ef5f5
 
09e3de1
73ef5f5
7f55b90
 
 
 
73ef5f5
7f55b90
 
73ef5f5
 
 
 
 
 
d56ff99
73ef5f5
 
 
 
d56ff99
73ef5f5
7f55b90
 
d56ff99
7f55b90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ef908c
d56ff99

---
language: 
  - en
license: apache-2.0
tags:
  - solidity
  - web3
  - code generation
  - smart contract
widget:
- text: "pragma solidity ^0.5.7;\n// Context: ParentA | Functions: helloA helloB | Constants: constantA \ncontract HelloWorld is ParentA {"
---

# A code generation T5 model for solidity (web3 smart contract)
- See https://github.com/hululuzhu/solidity-t5 for more context

## How to use this trained model
- A hello world example to use this model, notice the input `text` includes
  - Header solidity version like `pragma solidity ^0.5.7`
  - Ancestor class/library info, e.g. public functions and constants from `ParentA`
  - Contract/Library/Interface declaration header, e.g. `HelloWorld` ended with `{`
- Or simply use the test widget on the right side of the window and test, however
  the quality is known to be worse without decoding params

```python
# !pip install transformers -q

from transformers import AutoTokenizer, T5ForConditionalGeneration

DEVICE = 'cuda'  # fallback to cpu if you do not have cuda
tokenizer = AutoTokenizer.from_pretrained("hululuzhu/solidity-t5")
model = T5ForConditionalGeneration.from_pretrained("hululuzhu/solidity-t5").to(DEVICE)

text = """pragma solidity ^0.5.7;
// Context: ParentA | Functions: helloA helloB | Constants: constantA 
contract HelloWorld is ParentA {"""
input_ids = tokenizer(text, return_tensors="pt", truncation=True).input_ids.to(DEVICE)

# Need to tune beam/topk/topp params to get good outcome
generated_ids = model.generate(input_ids, max_length=256, num_beams=5, top_p=0.95, top_k=50)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

# Expect outcome
"""
string public constant name = "Hello World";
...
uint256 public constant override returns (uint256) {
return initialSupply;
}
function initialSupply() public view returns (uint256) {
...
"""
```

## Background
- Base T5 code model: https://huggingface.co/Salesforce/codet5-large
- Source data: https://huggingface.co/datasets/mwritescode/slither-audited-smart-contracts
  - Processing steps: Clean, contract-level segmentation sepration, split in and out
  - After processing input sample

    ```
    pragma solidity 0.5.7;
    // Context: PauserRole | Functions: isPauser addPauser renouncePauser | Constants: 
    contract Pausable is PauserRole {
    ```

  - After processing output sample (**notice indentation is bad, this is intentional to reduce token size**)

    ```
    event Paused(address account);
    event Unpaused(address account);
    bool private _pausableActive;
    bool private _paused;
    constructor () internal {
    _paused = false;
    }
    function paused() public view returns (bool) {
    return _paused;
    }
    modifier whenNotPaused() {
    require(!_paused);
    _;
    }
    modifier whenPaused() {
    require(_paused);
    _;
    }
    function pause() public onlyPauser whenNotPaused whenPausableActive {
    _paused = true;
    emit Paused(msg.sender);
    }
    function unpause() public onlyPauser whenPaused whenPausableActive {
    _paused = false;
    emit Unpaused(msg.sender);
    }
    function _setPausableActive(bool _active) internal {
    _pausableActive = _active;
    }
    modifier whenPausableActive() {
    require(_pausableActive);
    _;
    }
    }
    ```
- Source training code: See the [end to end notebook](https://github.com/hululuzhu/solidity-t5/blob/main/code/Solidity_T5_Data_Processing_and_Training.ipynb) at code dir here

## Future TODO
- The model is significantly under-trained because of lack of GPU budget, need 10x colab resources (~$100 for full train)
- This is quite limited on how the model is used, potentially we could switch to GPT2 decoder-only to compare, but CodeT5 has its strong code optimization
- Need more classifiers (T5 or BERT alike) to detect potential defects.