update
Browse files- datasets/polycoder.txt +1 -1
datasets/polycoder.txt
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
[PolyCoder paper
|
2 |
- Exact match deduplication
|
3 |
- Filtering:
|
4 |
- Average line length < 100 tokens
|
|
|
1 |
+
[PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The model was trained on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
|
2 |
- Exact match deduplication
|
3 |
- Filtering:
|
4 |
- Average line length < 100 tokens
|