File size: 1,398 Bytes
d1cd5a2 4a8f8af 228dc2e bb51c11 74b3221 bb51c11 74b3221 bb51c11 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
We also released [Github code dataset](https://huggingface.co/datasets/lvwerra/github-code), a 1TB of code data from Github repositories in 32 programming languages. It was created from the public GitHub dataset on Google [BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). The dataset can be loaded in streaming mode if you don't want to download it because of memory limitations, this will create an iterable dataset:
```python
from datasets import load_dataset
ds = load_dataset("lvwerra/github-code", streaming=True, split="train")
print(next(iter(ds)))
#OUTPUT:
{
'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
'repo_name': 'MirekSz/webpack-es6-ts',
'path': 'app/mods/mod190.js',
'language': 'JavaScript',
'license': 'isc',
'size': 73
}
```
You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file. Below is the distribution of programming languages in this dataset.
Below is the distribution of the pretraining data size of some code models:
<p align="center">
<img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/data_distrub.png" alt="drawing" width="450"/>
</p>
For detailed model-specific information about the pretraining dataset, please select a model below: |