About datasets and training setting

#6
by ldwang - opened

Thanks for sharing your great work.

  1. As your blog said "we train for 2.3T tokens on DCLM-baseline combined with the StarCoder and ProofPile2 datasets". How to set the ratio between the three, or just mix them together?
  2. Can I understand the training setup as first setting a large-scale run with a high token count, then stopping at a certain point for cooling? The total training volume is generally less than the preset total token count.

Thanks a lot for your reply.

Toyota Research Institute org
  1. We actually trained for 4.3T tokens. This version is v0 which is old. We recommend using this one https://huggingface.co/TRI-ML/DCLM-1B
    For the ratio between the three, we just mixed all of them together.

  2. The schedule was set for 4.3T. So for this 2.3T run, you are correct. We did cooldown at the 2.3T checkpoint. For the full 4.3T run, there was no cooldown needed because the schedule was set for 4.3T

ldwang changed discussion status to closed

Sign up or log in to comment