Data Is Better Together

community

Activity Feed

AI & ML interests

Building better datasets together

Recent Activity

davanstrien updated a dataset about 3 hours ago

data-is-better-together/fineweb-c-progress

davanstrien new activity 2 days ago

data-is-better-together/fineweb-c:fix config names

davanstrien updated a dataset 2 days ago

data-is-better-together/fineweb-c

View all activity

data-is-better-together's activity

davanstrien

updated a dataset about 3 hours ago

data-is-better-together/fineweb-c-progress

Viewer • Updated about 3 hours ago • 782 • 382 • 3

davanstrien

posted an update about 12 hours ago

Post

502

Why choose between strong LLM reasoning and efficient models?

Use DeepSeek to generate high-quality training data, then distil that knowledge into ModernBERT answerdotai/ModernBERT-base for fast, efficient classification.

Blog post: https://danielvanstrien.xyz/posts/2025/deepseek/distil-deepseek-modernbert.html

davanstrien

posted an update 1 day ago

Post

1353

Updated the ColPali Query Generator Space davanstrien/ColPali-Query-Generator to use Qwen/Qwen2.5-VL-7B-Instruct.

Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.

sayakpaul

posted an update 2 days ago

Post

1735

We have authored a post to go over the state of video generation in the Diffusers ecosystem 🧨

We cover the models supported, the knobs of optims our users can fire, fine-tuning, and more 🔥

5-6GBs for HunyuanVideo, sky is the limit 🌌 🤗
https://huggingface.co/blog/video_gen

davanstrien

posted an update 2 days ago

Post

1852

🌍 Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
• Japanese
• Italian
• Old High German

Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.

1 reply

davanstrien

in data-is-better-together/fineweb-c 2 days ago

fix config names

#11 opened 2 days ago by

davanstrien

updated a dataset 2 days ago

data-is-better-together/fineweb-c

Viewer • Updated 2 days ago • 56.1k • 877 • 36

davidberenstein1957

posted an update 3 days ago

Post

1678

Let's uncover the post-training dataset from DeepSeek-R1 with Magpie!

Pass pre-query tokens <｜begin▁of▁sentence｜>User: , let the model generate the rest.

We can get realistic examples!

Gist: https://gist.github.com/davidberenstein1957/3f20046ce57395a6aba13f8b4e956b59

6 replies

burtenshaw

posted an update 3 days ago

Post

2218

Manic few days in open source AI, with game changing development all over the place. Here's a round up of the resources:

- The science team at @huggingface reproduced and open source the seek r1. https://github.com/huggingface/open-r1
- @qwen released a series of models with 1 million token context! https://qwenlm.github.io/blog/qwen2.5-1m/
- SmolVLM got even smaller with completely new variants at 256m and 500m https://huggingface.co/blog/smolervlm

There's so much you could do with these developments. Especially combining them together into agentic applications or fine-tuning them on your use case.

1 reply

burtenshaw

posted an update 5 days ago

Post

627

Hey 👋

I'm helping out on some community research to learn about the AI community. If you want to join in the conversation, head over here where I started a community discussion on the most influential model since BERT.

OSAIResearchCommunity/README#2

burtenshaw

posted an update 5 days ago

Post

1403

📣 Teachers and Students! Here's a handy quiz app if you're preparing your own study material.

TLDR, It's a quiz that uses a dataset to make questions and save answers

Here's how it works:

- make a dataset of multiple choice questions
- duplicate the space add set the dataset repo
- log in and do the quiz
- submit the questions to create a new dataset

I made this to get ready for the agents course, but I hope it's useful for you projects too!

quiz app burtenshaw/dataset_quiz

dataset with questions burtenshaw/exam_questions

agents course we're working on https://huggingface.co/agents-course

burtenshaw

posted an update 6 days ago

Post

2054

AI was built on side projects!

davidberenstein1957

in data-is-better-together/open-image-preferences-v1-flux-dev-lora 6 days ago

Training with DPO ?

#11 opened 6 days ago by

blanchon

burtenshaw

posted an update 8 days ago

Post

3459

🚧 Work in Progress! 🚧

👷‍♀️ We're working hard on getting the official agents course ready for the 50,000 students that have signed up.

If you want to contribute to the discussion, I started these community posts. Looking forward to hearing from you:

- smolagents unit in the agents course - agents-course/README#7
- LlamaIndex Unit in the agents course - agents-course/README#6
- LangChain and LangGraph unit in the agents course - agents-course/README#5
- Real world use cases in the agents course - agents-course/README#8

davidberenstein1957

posted an update 8 days ago

Post

1849

The RAG's in the bag!

You can now use the Synthetic Data Generator with your own domain-specific seed data to generate a dataset for fine-tuning retrieval or reranking models.

GitHub: https://buff.ly/49IDSmd
Spaces: https://buff.ly/3Y1S99z
Blog: https://huggingface.co/blog/sdiazlor/fine-tune-modernbert-for-rag-with-synthetic-data

1 reply

davanstrien

updated a Space 9 days ago

Running

🌐📊

FineWeb 2 - Community Leaderboard

davidberenstein1957

posted an update 12 days ago

Post

1231

You can now use the "Synthetic Data Generator" at a much larger scale with your preferred inference engine: Ollama, vLLM, TGI, and serverless inference! 🔥

Install, configure, launch!

Space: argilla/synthetic-data-generator
Examples: https://github.com/argilla-io/synthetic-data-generator/tree/main/examples

nataliaElv

posted an update 13 days ago

Post

1424

New chapter in the Hugging Face NLP course! 🤗 🚀

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.

Any feedback for improvements welcome!

https://huggingface.co/learn/nlp-course/chapter10

burtenshaw

posted an update 13 days ago

Post

1336

Playing with agents, and I reckon Gradio spaces make the perfect agent tools! So I wrote this guide using Gradio and smolagents:

https://huggingface.co/blog/burtenshaw/gradio-spaces-agent-tools

guipenedo

authored a paper 14 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 15 days ago • 51

AI & ML interests

Recent Activity

Team members 15

data-is-better-together's activity

fix config names

Training with DPO ?

FineWeb 2 - Community Leaderboard