Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 β’ 66
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 27
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages Paper β’ 2411.14343 β’ Published 1 day ago β’ 4 β’ 2
RedPajama: an Open Dataset for Training Large Language Models Paper β’ 2411.12372 β’ Published 4 days ago β’ 41 β’ 3
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper β’ 2406.17557 β’ Published Jun 25 β’ 86 β’ 5
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? Paper β’ 2309.07462 β’ Published Sep 14, 2023 β’ 4 β’ 2
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper β’ 2406.19389 β’ Published Jun 27 β’ 51 β’ 10