-
davanstrien/reasoning-required
Viewer β’ Updated β’ 5k β’ 70 β’ 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification β’ 0.1B β’ Updated β’ 3 β’ 10 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer β’ Updated β’ 10k β’ 30 β’ 1 -
davanstrien/fine-reasoning-questions
Viewer β’ Updated β’ 244 β’ 38 β’ 19
Daniel van Strien PRO
davanstrien
AI & ML interests
Machine Learning Librarian
Recent Activity
updated
a dataset
about 1 hour ago
data-is-better-together/fineweb-c-progress
updated
a dataset
about 6 hours ago
librarian-bots/dataset-columns
liked
a model
1 day ago
facebook/MobileLLM-Pro
Organizations
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
-
Running8484
Semantic Hugging Face Hub Search
πFind datasets and models using semantic search
-
open-r1/OpenR1-Math-220k
Viewer β’ Updated β’ 450k β’ 7.34k β’ 652 -
simplescaling/s1K-1.1
Viewer β’ Updated β’ 1k β’ 2.56k β’ 138 -
MU-NLPC/Calc-ape210k
Viewer β’ Updated β’ 404k β’ 1.12k β’ 25
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
-
bigcode/self-oss-instruct-sc2-exec-filter-50k
Viewer β’ Updated β’ 50.7k β’ 307 β’ 101 -
davanstrien/similarity-dataset-sc2-8b
Viewer β’ Updated β’ 2.32k β’ 21 β’ 6 -
davanstrien/code-prompt-similarity-model
Sentence Similarity β’ 0.1B β’ Updated β’ 1 β’ 6 -
davanstrien/abstract-wiki
Viewer β’ Updated β’ 5k β’ 13 β’ 2
haiku
πΈ This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Probably DPO datasets
A collection of datasets that probably support DPO
-
HuggingFaceH4/ultrafeedback_binarized
Viewer β’ Updated β’ 187k β’ 18.8k β’ 310 -
mlabonne/orpo-dpo-mix-40k
Viewer β’ Updated β’ 44.2k β’ 597 β’ 291 -
argilla/OpenHermesPreferences
Viewer β’ Updated β’ 989k β’ 345 β’ 210 -
argilla/distilabel-capybara-dpo-7k-binarized
Viewer β’ Updated β’ 7.56k β’ 2.84k β’ 182
query-to-hub-datasets-viewer-project
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation β’ 0.4B β’ Updated β’ 1 β’ 10 -
Running8484
Semantic Hugging Face Hub Search
πFind datasets and models using semantic search
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer β’ Updated β’ 5k β’ 20 β’ 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer β’ Updated β’ 5k β’ 36 β’ 1
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
-
Runtime error88
Genstruct 7B
π -
Runtime error8686
Instruction Synthesizer
πGenerate instruction-response pairs from text
-
Running on Zero7171
Magpie
π¦Generate and rate instruction-response pairs
-
Runtime error1111
Bonito
π¬Generate task-specific instructions and responses from text
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper β’ 2404.14361 β’ Published β’ 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper β’ 2403.04190 β’ Published β’ 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper β’ 2404.07503 β’ Published β’ 31 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper β’ 2404.14445 β’ Published
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
-
dbmdz/bert-base-finnish-europeana-cased
Fill-Mask β’ 0.1B β’ Updated β’ 9 -
dbmdz/bert-base-historic-english-cased
Fill-Mask β’ 0.1B β’ Updated β’ 17 β’ 1 -
Livingwithmachines/erwt-year
Fill-Mask β’ Updated β’ 1 -
dbmdz/bert-base-historic-dutch-cased
Fill-Mask β’ 0.1B β’ Updated β’ 3.74k β’ β’ 2
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
Reasoning Required?
-
davanstrien/reasoning-required
Viewer β’ Updated β’ 5k β’ 70 β’ 19 -
davanstrien/ModernBERT-based-Reasoning-Required
Text Classification β’ 0.1B β’ Updated β’ 3 β’ 10 -
davanstrien/fineweb-with-reasoning-scores-and-topics
Viewer β’ Updated β’ 10k β’ 30 β’ 1 -
davanstrien/fine-reasoning-questions
Viewer β’ Updated β’ 244 β’ 38 β’ 19
hub-tldr
Creating a smol model for tl;dr-ing the hub
-
davanstrien/Smol-Hub-tldr
Text Generation β’ 0.4B β’ Updated β’ 1 β’ 10 -
Running8484
Semantic Hugging Face Hub Search
πFind datasets and models using semantic search
-
davanstrien/hub-tldr-dataset-summaries-llama
Viewer β’ Updated β’ 5k β’ 20 β’ 1 -
davanstrien/hub-tldr-model-summaries-llama
Viewer β’ Updated β’ 5k β’ 36 β’ 1
Maths reasoning
Maths reasoning datasets found using https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search
-
Running8484
Semantic Hugging Face Hub Search
πFind datasets and models using semantic search
-
open-r1/OpenR1-Math-220k
Viewer β’ Updated β’ 450k β’ 7.34k β’ 652 -
simplescaling/s1K-1.1
Viewer β’ Updated β’ 1k β’ 2.56k β’ 138 -
MU-NLPC/Calc-ape210k
Viewer β’ Updated β’ 404k β’ 1.12k β’ 25
synthetic-data-generation-demos
A collection of demos for various approaches to synthetic data generation
-
Runtime error88
Genstruct 7B
π -
Runtime error8686
Instruction Synthesizer
πGenerate instruction-response pairs from text
-
Running on Zero7171
Magpie
π¦Generate and rate instruction-response pairs
-
Runtime error1111
Bonito
π¬Generate task-specific instructions and responses from text
sentence-transformers-from-synthetic-data
Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model
-
bigcode/self-oss-instruct-sc2-exec-filter-50k
Viewer β’ Updated β’ 50.7k β’ 307 β’ 101 -
davanstrien/similarity-dataset-sc2-8b
Viewer β’ Updated β’ 2.32k β’ 21 β’ 6 -
davanstrien/code-prompt-similarity-model
Sentence Similarity β’ 0.1B β’ Updated β’ 1 β’ 6 -
davanstrien/abstract-wiki
Viewer β’ Updated β’ 5k β’ 13 β’ 2
Synthetic (text) Dataset Generation
Papers about synthetic dataset generation
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper β’ 2404.14361 β’ Published β’ 2 -
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future
Paper β’ 2403.04190 β’ Published β’ 1 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper β’ 2404.07503 β’ Published β’ 31 -
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
Paper β’ 2404.14445 β’ Published
haiku
πΈ This is a collection of synthetic datasets built to help improve the ability of open language models to better write haikus through the use of DPO
Historic language modeling
This collection contains models, datasets and spaces related to historic language models i.e. language models trained on historic data
-
dbmdz/bert-base-finnish-europeana-cased
Fill-Mask β’ 0.1B β’ Updated β’ 9 -
dbmdz/bert-base-historic-english-cased
Fill-Mask β’ 0.1B β’ Updated β’ 17 β’ 1 -
Livingwithmachines/erwt-year
Fill-Mask β’ Updated β’ 1 -
dbmdz/bert-base-historic-dutch-cased
Fill-Mask β’ 0.1B β’ Updated β’ 3.74k β’ β’ 2
Probably DPO datasets
A collection of datasets that probably support DPO
-
HuggingFaceH4/ultrafeedback_binarized
Viewer β’ Updated β’ 187k β’ 18.8k β’ 310 -
mlabonne/orpo-dpo-mix-40k
Viewer β’ Updated β’ 44.2k β’ 597 β’ 291 -
argilla/OpenHermesPreferences
Viewer β’ Updated β’ 989k β’ 345 β’ 210 -
argilla/distilabel-capybara-dpo-7k-binarized
Viewer β’ Updated β’ 7.56k β’ 2.84k β’ 182
Image Preference Optimization Datasets
Datasets suitable for Image Preference Optimization based on their colum names
query-to-hub-datasets-viewer-project