Post
2452
Still following your human intuition to mix corpora from different sources for pre-training ๐ง ? Everyone says that data mixture has a big impact on model performance, but how - and why๐ต๏ธ? Did you know that web corpora are actually highly impactful for downstream tasks ๐?
Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" ๐
๐ฌ In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! ๐
๐ Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
๐ป Code: https://github.com/sail-sg/regmix
๐ Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
๐ฎ Demo: https://huggingface.co/spaces/sail/RegMix
Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" ๐
๐ฌ In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! ๐
๐ Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
๐ป Code: https://github.com/sail-sg/regmix
๐ Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
๐ฎ Demo: https://huggingface.co/spaces/sail/RegMix