Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation Jun 20, 2024 • 12
Synthetic dataset generation techniques: generating custom sentence similarity data May 23, 2024 • 16
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 • 72
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 28
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 • 1
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models Paper • 2501.09653 • Published 1 day ago • 8
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages Paper • 2501.08284 • Published 4 days ago • 5
view article Article Train 400x faster Static Embedding Models with Sentence Transformers 3 days ago • 98
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training Paper • 2501.08197 • Published 4 days ago • 7
high-quality Chinese training datasets Collection a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets. • 12 items • Updated about 23 hours ago • 8
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper • 2501.07171 • Published 5 days ago • 45
HistBERTurk-Models Collection Fine-tuned BERTurk models for historical Turkish. • 3 items • Updated 13 days ago • 2
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models Paper • 2501.04828 • Published 9 days ago • 11
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution Paper • 2501.05040 • Published 9 days ago • 14
BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations Paper • 2501.03403 • Published 11 days ago • 4
view article Article Synthetic Data Generation with FastData and Hugging Face By asoria • 10 days ago • 13
view article Article Crowd-sourced Open Preference Dataset for Text-to-Image Generation By RapidataAI • 11 days ago • 17
CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions Paper • 2501.00097 • Published 18 days ago • 1
view article Article FineWeb2-C: Help Build Better Language Models in Your Language By davanstrien • 26 days ago • 13
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper • 2412.13663 • Published about 1 month ago • 123
Granite 3.1 Language Models Collection A series of language models with 128K context length trained by IBM licensed under Apache 2.0 license. • 8 items • Updated about 1 month ago • 48
ModernBERT Collection Bringing BERT into modernity via both architecture changes and scaling • 3 items • Updated 30 days ago • 123