Dataset for reproduction
#7
by
ahans1
- opened
Do you plan to release the dataset for reproduction of this training run? I know you have released the cosmopedia dataset which has 25B tokens out of 30B, but can you release the exact split for non-synthetic 5B tokens used for this model?