update
Browse files- datasets/codeparrot.txt +1 -1
datasets/codeparrot.txt
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
[CodeParrot](https://huggingface.co/lvwerra/codeparrot)
|
2 |
- Exact match deduplication
|
3 |
- Filtering:
|
4 |
- Average line length < 100 tokens
|
|
|
1 |
+
[CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
|
2 |
- Exact match deduplication
|
3 |
- Filtering:
|
4 |
- Average line length < 100 tokens
|