Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation Jun 20, 2024 β’ 12
Synthetic dataset generation techniques: generating custom sentence similarity data May 23, 2024 β’ 16
Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia? May 7, 2024 β’ 8
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 β’ 74
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 29
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
data-is-better-together/fineweb-c-progress Viewer β’ Updated about 3 hours ago β’ 782 β’ 382 β’ 3
ymoslem/ModernBERT-base-long-context-qe-v1 Text Classification β’ Updated about 12 hours ago β’ 3 β’ 4
view post Post 502 Why choose between strong LLM reasoning and efficient models?Use DeepSeek to generate high-quality training data, then distil that knowledge into ModernBERT answerdotai/ModernBERT-base for fast, efficient classification.Blog post: https://danielvanstrien.xyz/posts/2025/deepseek/distil-deepseek-modernbert.html See translation π 3 3 + Reply
librarian-bots/dataset_cards_with_metadata Viewer β’ Updated about 19 hours ago β’ 216k β’ 347 β’ 12