File size: 737 Bytes
873252e c9e8e4a 873252e c9e8e4a |
1 2 3 4 5 6 7 8 9 |
[CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: - Exact match deduplication - Filtering: - Average line length < 100 tokens - Maximum line length < 1000 MB - Alpha numeric characters fraction > 0.25 - Remove auto-generated files (keyword search) For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot). |