update
Browse files- datasets/codegen.txt +1 -1
datasets/codegen.txt
CHANGED
@@ -11,7 +11,7 @@ The second and third datasets used the following preprocessing:
|
|
11 |
- Exact match deduplication
|
12 |
- Average line length < 100 tokens
|
13 |
- Maximum line length < 1000 MB
|
14 |
-
-
|
15 |
|
16 |
**Remark**:
|
17 |
The reported data sizes are after preprocessing.
|
|
|
11 |
- Exact match deduplication
|
12 |
- Average line length < 100 tokens
|
13 |
- Maximum line length < 1000 MB
|
14 |
+
- Characters being decimal or hexadecimal digits >90%
|
15 |
|
16 |
**Remark**:
|
17 |
The reported data sizes are after preprocessing.
|