Spaces:

codeparrot
/

code-generation-models

Running

loubnabnl HF Staff commited on May 27, 2022

Commit

7678306

1 Parent(s): f81388d

update

Files changed (1) hide show

datasets/codegen.txt CHANGED Viewed

@@ -11,7 +11,7 @@ The second and third datasets used the following preprocessing:
     - Exact match deduplication
     - Average line length < 100 tokens
     - Maximum line length < 1000 MB
-    - >90% of the characters being decimal or hexadecimal digits
 **Remark**:
 The reported data sizes are after preprocessing.

     - Exact match deduplication
     - Average line length < 100 tokens
     - Maximum line length < 1000 MB
+    - Characters being decimal or hexadecimal digits >90%
 **Remark**:
 The reported data sizes are after preprocessing.