Post
76
šļø Google Code Archive Dataset -
nyuuzyou/google-code-archive
Expanding beyond the modern code series, this release presents a massive historical snapshot from the Google Code Archive. This dataset captures the open-source landscape from 2006 to 2016, offering a unique time capsule of software development patterns during the era before GitHub's dominance.
Key Stats:
- 65,825,565 files from 488,618 repositories
- 47 GB compressed Parquet storage
- 454 programming languages (Heavily featuring Java, PHP, and C++)
- Extensive quality filtering (excluding vendor code and build artifacts)
- Rich historical metadata: original repo names, file paths, and era-specific licenses
This is one of those releases that I'm most interested in getting feedback on. Would you like to see more old code datasets?
Expanding beyond the modern code series, this release presents a massive historical snapshot from the Google Code Archive. This dataset captures the open-source landscape from 2006 to 2016, offering a unique time capsule of software development patterns during the era before GitHub's dominance.
Key Stats:
- 65,825,565 files from 488,618 repositories
- 47 GB compressed Parquet storage
- 454 programming languages (Heavily featuring Java, PHP, and C++)
- Extensive quality filtering (excluding vendor code and build artifacts)
- Rich historical metadata: original repo names, file paths, and era-specific licenses
This is one of those releases that I'm most interested in getting feedback on. Would you like to see more old code datasets?