File size: 1,042 Bytes
20b7412 1378d9b f4022e4 f81388d 1378d9b 7678306 1378d9b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
[Codegen](https://huggingface.co/Salesforce/codegen-16B-mono) is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system. It was sequentially trained on three datasets: - [The Pile](https://huggingface.co/datasets/the_pile) - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python - 217GB of Python data from Github repositories The second and third datasets used the following preprocessing: - Exact match deduplication - Filtering: - Exact match deduplication - Average line length < 100 tokens - Maximum line length < 1000 MB - Characters being decimal or hexadecimal digits >90% **Remark**: The reported data sizes are after preprocessing. |