Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepΒ 
posted an update Jun 8, 2024
Post
1370
πŸ“ˆ One of the biggest changes in Llama 3 was the training dataset, which grew by 7X over Llama 2 (2T to 15T tokens) πŸš€

While Meta did not open source the dataset, it sparked a thought... what would happen if everyone had access to a big, high-quality dataset? πŸ€”

To address that, in April this year, @huggingface released FineWeb, a 15T token open-source dataset 🌍

And now they are releasing FineWeb Technical Report and FineWeb Edu πŸ“š

πŸ† 15T tokens in FineWeb outperforming other open datasets
πŸŽ“ 1.3T highest-quality educational dataset FineWeb-Edu
πŸ“˜ 5.4T high-quality educational tokens in FineWeb-Edu-2

FineWeb Edu outperforms other datasets on MMLU, ARC, OpenBookQA πŸ“ˆ

ODC-By 1.0 license πŸ“œ

Report: HuggingFaceFW/blogpost-fineweb-v1