Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 66
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 27
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages Paper • 2411.14343 • Published 1 day ago • 4
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published 1 day ago • 26
Tulu 3 Datasets Collection All datasets released with Tulu 3 -- state of the art open post-training recipes. • 32 items • Updated 1 day ago • 17
Tulu 3 Models Collection All models released with Tulu 3 -- state of the art open post-training recipes. • 7 items • Updated 1 day ago • 15
Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline Paper • 2411.12814 • Published 4 days ago • 1
view article Article Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK By davidberenstein1957 • 2 days ago • 17
OpenScholar_V1 Collection The set of models, index, data associated with the paper "OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs". • 8 items • Updated 1 day ago • 18
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published 4 days ago • 41
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published 8 days ago • 93
Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language Paper • 2410.23956 • Published 23 days ago • 1
AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model Paper • 2411.09012 • Published 9 days ago • 1
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? Paper • 2309.07462 • Published Sep 14, 2023 • 4
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • 10 days ago • 94
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists Paper • 2410.23331 • Published 24 days ago • 7
SmolLM2 Collection State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M • 10 items • Updated 2 days ago • 173