|
Codegen is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system. |
|
|
|
It was was sequentially trained on three datasets: |
|
- The Pile |
|
- A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python |
|
- 217GB of Python data from Github repositories |
|
|
|
The second and third datasets used the following preprocessing: |
|
- Exact match deduplication |
|
- Filtering: |
|
- Exact match deduplication |
|
- Average line length < 100 tokens |
|
- Maximum line length < 1000 MB |
|
- >90% of the characters being decimal or hexadecimal digits |
|
|
|
**Remark**: |
|
The reported data sizes are after preprocessing. |