Towards Best Practices for Open Datasets for LLM Training Paper • 2501.08365 • Published 3 days ago • 37
high-quality Chinese training datasets Collection a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets. • 12 items • Updated about 21 hours ago • 8
view article Article Synthetic Data Generation with FastData and Hugging Face By asoria • 10 days ago • 13
Reasoning Datasets Collection Reasoning datasets that are trending 🔥 • 10 items • Updated 14 days ago • 23
view article Article Bridging the Gap Between Physical Numerical Simulations and Machine Learning: Introducing The Well By rubenohana • Dec 2, 2024 • 17
Marqo-Ecommerce-Embeddings Collection State-of-the-art embedding models fine-tuned for the ecommerce domain. +67% increase in evaluation metrics vs ViT-B-16-SigLIP. • 10 items • Updated Nov 14, 2024 • 17
NLI Eval Datasets Collection A curated collection of NLI evaluation datasets. Each dataset is exactly as originally proposed • 19 items • Updated Nov 12, 2024 • 3
BhasaAnuvaad Collection A Speech Translation Dataset for 13 Indian Languages • 11 items • Updated 2 days ago • 14
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • Nov 13, 2024 • 98
view article Article Transformers.js v3: WebGPU support, new models & tasks, and more… Oct 22, 2024 • 66
DataEnvGym Collection Skills, datasets, etc for DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback • 6 items • Updated Oct 10, 2024 • 1
view article Article 🥐CroissantLLM: A Truly Bilingual French-English Language Model By manu • Feb 5, 2024 • 11