About datasets and training setting

by ldwang - opened 29 days ago

ldwang

29 days ago

Thanks for sharing your great work.

As your blog said "we train for 2.3T tokens on DCLM-baseline combined with the StarCoder and ProofPile2 datasets". How to set the ratio between the three, or just mix them together?
Can I understand the training setup as first setting a large-scale run with a high token count, then stopping at a certain point for cooling? The total training volume is generally less than the preset total token count.

Thanks a lot for your reply.

Toyota Research Institute org 29 days ago

We actually trained for 4.3T tokens. This version is v0 which is old. We recommend using this one https://huggingface.co/TRI-ML/DCLM-1B
For the ratio between the three, we just mixed all of them together.
The schedule was set for 4.3T. So for this 2.3T run, you are correct. We did cooldown at the 2.3T checkpoint. For the full 4.3T run, there was no cooldown needed because the schedule was set for 4.3T

ldwang

27 days ago

Thanks.

ldwang changed discussion status to closed 27 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment