update datasets
Browse files- datasets/codeparrot.txt +4 -4
datasets/codeparrot.txt
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
-
[CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
|
2 |
- Exact match deduplication
|
3 |
-
- Filtering
|
4 |
-
- Average line length < 100
|
5 |
-
- Maximum line length < 1000
|
6 |
- Alpha numeric characters fraction > 0.25
|
7 |
- Remove auto-generated files (keyword search)
|
8 |
|
|
|
1 |
+
[CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
|
2 |
- Exact match deduplication
|
3 |
+
- Filtering:
|
4 |
+
- Average line length < 100 tokens
|
5 |
+
- Maximum line length < 1000 MB
|
6 |
- Alpha numeric characters fraction > 0.25
|
7 |
- Remove auto-generated files (keyword search)
|
8 |
|