|
[InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via [infilling](https://arxiv.org/pdf/2204.05999.pdf). It was trained on **216 GB** of preprocessed data from GitHub and Stack Overflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code. |
|
|
|
The GitHub data was cleaned with the following steps: |
|
- Average line length < 100 tokens |
|
- Maximum line length < 3000 MB |
|
- Alphanumeric characters fraction > 0.4 |
|
- Remove auto-generated files (keyword search) |
|
|
|
The second component of the data consists of questions, answers, and comments from Stack Overflow. It includes: |
|
- all questions that have at least one answer |
|
- up to ten answers with a non-negative score (sorted by score) per question |
|
- up to five comments per question/answer |
|
|
|
Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf). |