Spaces:

ibrim
/

NanoGPT

Sleeping

App Files Files Community

NanoGPT / data /openwebtext /readme.md

ibrim's picture

Upload 10 files

6cf1d95 verified 8 months ago

|

489 Bytes

openwebtext dataset

after running prepare.py (preprocess) we get:

train.bin is ~17GB, val.bin ~8.5MB
train has ~9B tokens (9,035,582,198)
val has ~4M tokens (4,434,897)

this came from 8,013,769 documents in total.

references:

OpenAI's WebText dataset is discussed in GPT-2 paper
OpenWebText dataset