Spaces:

codeparrot
/

code-generation-models

Running

loubnabnl HF Staff commited on May 27, 2022

Commit

873252e

1 Parent(s): 2b3c79e

update datasets

Files changed (1) hide show

datasets/codeparrot.txt CHANGED Viewed

@@ -1,8 +1,8 @@
-[CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
 - Exact match deduplication
-- Filtering
-  - Average line length < 100
-  - Maximum line length < 1000
   - Alpha numeric characters fraction > 0.25
   - Remove auto-generated files (keyword search)

+[CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
 - Exact match deduplication
+- Filtering:
+  - Average line length < 100 tokens
+  - Maximum line length < 1000 MB
   - Alpha numeric characters fraction > 0.25
   - Remove auto-generated files (keyword search)