FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale Paper • 2601.22146 • Published 5 days ago • 8
Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem Paper • 2512.03073 • Published Nov 27, 2025 • 5
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 19
SindBERT, the Sailor: Charting the Seas of Turkish NLP Paper • 2510.21364 • Published Oct 24, 2025 • 1
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 9
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 9
Automatic Metadata Generation and Extraction datasets Collection Datasets which can help train or evaluate various approaches to automatic metadata generation and extraction. • 4 items • Updated Oct 16, 2025 • 4
MeXtract: Light-Weight Metadata Extraction from Scientific Papers Paper • 2510.06889 • Published Oct 8, 2025 • 1
Index card datasets Collection Index card datasets for training and evaulating models for conversion of index cards to structured data/metadata • 3 items • Updated Oct 6, 2025 • 1
Automatic Metadata Generation and Extraction datasets Collection Datasets which can help train or evaluate various approaches to automatic metadata generation and extraction. • 4 items • Updated Oct 16, 2025 • 4