Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -14,4 +14,6 @@ This organization contains the full datasets used to train StarCoder2:
|
|
14 |
- `the-stack-v2-train-full`: contains the training data with 600+ programming languages used to train StarCoder2-15B with the files concatenated per repository
|
15 |
- `the-stack-v2-train-full-files`: same as `the-stack-v2-train-full` but without repository concatenation which makes filtering files or licenses easier
|
16 |
- `the-stack-v2-train-smol`: contains the training data with 17 programming languages used to train StarCoder2-3B and 7B with the files concatenated per repository
|
17 |
-
- `the-stack-v2-train-smol-files`: same as `the-stack-v2-train-smol` but without repository concatenation which makes filtering files or licenses easier
|
|
|
|
|
|
14 |
- `the-stack-v2-train-full`: contains the training data with 600+ programming languages used to train StarCoder2-15B with the files concatenated per repository
|
15 |
- `the-stack-v2-train-full-files`: same as `the-stack-v2-train-full` but without repository concatenation which makes filtering files or licenses easier
|
16 |
- `the-stack-v2-train-smol`: contains the training data with 17 programming languages used to train StarCoder2-3B and 7B with the files concatenated per repository
|
17 |
+
- `the-stack-v2-train-smol-files`: same as `the-stack-v2-train-smol` but without repository concatenation which makes filtering files or licenses easier
|
18 |
+
|
19 |
+
See the [tech report](https://arxiv.org/pdf/2402.19173) for all the details on the dataset.
|