Bram Vanroy PRO

BramVanroy

https://bramvanroy.github.io/

AI & ML interests

Artificial intelligence, natural language processing, computational linguistics

Recent Activity

liked a dataset 10 days ago

GPT-NL/DuidelijkeTaal-v1.0-split

liked a dataset 13 days ago

nvidia/Nemotron-Personas-France

reacted to yuriyvnv's post with 🚀 18 days ago

🎯 WAVe-1B-Multimodal-NL: Word-Level Speech Quality Assessment for Dutch Following the release of the Portuguese model, we're releasing the Dutch variant of WAVe — a 1B multimodal embedding model that assesses synthetic speech quality at the word level, thereby improving the quality of synthetically augmented datasets for training ASR models. Trained on CommonVoice 16.1 Dutch with 5 corruption strategies, this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely. Resources - Dutch model: https://huggingface.co/yuriyvnv/WAVe-1B-Multimodal-NL - Portuguese model: https://huggingface.co/yuriyvnv/WAVe-1B-Multimodal-PT - Code: https://github.com/yuriyvnv/WAVe This model builds on CommonVoice Dutch data — thanks to @mozilla and the CommonVoice community for making multilingual speech data accessible. Would be great to hear from the Dutch NLP community — @BramVanroy @GroNLP — especially if you're working on Dutch ASR or TTS pipelines where quality filtering could help. Also tagging @hf-audio as this sits at the intersection of speech processing and data curation.

View all activity

Organizations

reactedto yuriyvnv's post with 🚀 18 days ago

Post

411

🎯 WAVe-1B-Multimodal-NL: Word-Level Speech Quality Assessment for Dutch

Following the release of the Portuguese model, we're releasing the Dutch variant of WAVe — a 1B multimodal embedding model that assesses synthetic speech quality at the word level, thereby improving the quality of synthetically augmented datasets for training ASR models.

Trained on CommonVoice 16.1 Dutch with 5 corruption strategies, this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely.
Resources

- Dutch model: yuriyvnv/WAVe-1B-Multimodal-NL
- Portuguese model: yuriyvnv/WAVe-1B-Multimodal-PT
- Code: https://github.com/yuriyvnv/WAVe

This model builds on CommonVoice Dutch data — thanks to @mozilla and the CommonVoice community for making multilingual speech data accessible.

Would be great to hear from the Dutch NLP community — @BramVanroy @GroNLP — especially if you're working on Dutch ASR or TTS pipelines where quality filtering could help. Also tagging @hf-audio as this sits at the intersection of speech processing and data curation.

reactedto onekq's post with 🔥 5 months ago

Post

2871

The reaction on the QAT post is beyond expectations so below is my optimizer post as promised. But I found that I had lots of explanation to do about optimizer itself. So this post is actually a historical recount. The Muon optimizer (used by Kimi) post (coming very soon) can only continue after this.

https://huggingface.co/blog/onekq/adam-optimizer

If you know Adam(W) optimizer already, you can just skip and sorry for the wait. Otherwise, it should be a useful read.

posted an update 6 months ago

Post

597

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

1 reply

reactedto Reubencf's post with 🔥 7 months ago

Post

2436

Releasing Version 1 of the Konkani LLM

Konkani is a low resource Indian language spoken by 2.5 Million people

Model:https://huggingface.co/Reubencf/gemma3-konkani

space: https://huggingface.co/spaces/Reubencf/Gemma3-konkani

2 replies

posted an update 8 months ago

Post

1061

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

reactedto pagezyhf's post with 🚀 9 months ago

Post

1576

In our recent push to make more models available on Azure, we recently added SmolLM v3 in the catalog! 🚀

@juanjucm wrote a really detailed guide on how to deploy on Azure AI 🤗

https://huggingface.co/docs/microsoft-azure/azure-ai/examples/deploy-smollm3

If you want to see other models, please let us know

1 reply

repliedto their post 11 months ago

Special thanks to:

The Common Crawl folks (Greg Lindahl, @pjox and others)
The datatrove and FineWeb teams at Hugging Face (@guipenedo and others)

posted an update 11 months ago

Post

3695

📢💾 Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!

1 reply

reactedto julien-c's post with ❤️🔥 over 1 year ago

Post

11470

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team

29 replies

posted an update over 1 year ago

Post

1261

In the spirit of "Better late than never", I've finally written a brief overview paper for GEITje 7B Ultra. Initially released 10 months ago (oops), but still reaching around 1300 monthly downloads across the HF ecosystem (not including ollama).

GEITje 7B Ultra: A Conversational Model for Dutch (2412.04092)

While the paper discusses the model a little bit, I especially wanted to write about the datasets, which to this day seem an important asset for Dutch LLM training (SFT and preference tuning). We have a long way to go for Dutch, but publishing transparent and reproducible artefacts seems an important step to me, alongside having open discussions about data, bias, architectures.

In that spirit, thanks are in order for the creation of GEITje 7B Ultra and all related datasets:

- Michiel Buisman and UWV for providing the means to create the datasets
- Flemish Supercomputer Center (VSC) for the compute
- The Hugging Face Fellows and rest of the team for their discussions and insights
- The Dutch NLP community, notably @Rijgersberg for building the base GEITje model and the fruitful discussions we've had

More to come, step by step!

BramVanroy/geitje-7b-ultra-65c1ee010ad80fd1f6a8f208

reactedto davidberenstein1957's post with 🚀 over 1 year ago

Post

3034

🚀 We will be generating a preference dataset for DPO/ORPO and cleaning it with AI feedback during our upcoming meetup!

In this session, we'll walk you through the essentials of building a distilabel pipeline by exploring two key use cases: cleaning an existing dataset and generating a preference dataset for DPO/ORPO. You’ll also learn how to make the most of AI feedback, integrating Argilla to gather human feedback and improve the overall data quality.

This session is perfect for you
- if you’re getting started with distilabel or synthetic data
- if you want to learn how to use LLM inference endpoints for **free**
- if you want to discover new functionalities
- if you want to provide us with new feedback

Sign up here: https://lu.ma/dt0c7jru

posted an update almost 2 years ago

Post

2229

The InstructGPT paper mentions that they insert 10% pretraining data during SFT, which they find improves the effect of PPO (IIUC). Has anyone else done later ablations on this? I've only seen the inverse suggested, mixing in SFT data during pretraining.

2 replies

reactedto adamm-hf's post with ❤️ almost 2 years ago

Post

2152

cooking up something....anyone interested in a daily activity tracker for HF?

12 replies

posted an update almost 2 years ago

Post

2321

All my models seem to be plagued by infinite lists. When you ask a question that requires it to write a list, it most often keeps adding bullet points or enumeration. I am wondering whether this is a result of using chatty GPT-4 as DPO preferences. Any thoughts?

1 reply

repliedto their post almost 2 years ago

Nice! In my experience preference tuning with the ultra feedback datasets does not really change benchmark scores (and sometimes even makes them worse) but it does seem to improve the real-world user experience when chatting with the model.

I'm also not sure if orpo only on UF is better than sft on UC + DPO on UF, especially if you're also trying to do language adaptation. That, or first continue pretraining the model and then doing orpo.

repliedto their post almost 2 years ago

While the "rules" of OpenAI do get frustrating from time to time, I do not blame others who do not follow the same path as I do. If I am asked why my licenses are different from someone else's I will answer according to what I've written in the post above (the rules suck and our vague, I understand why people do what they do and I do what I do because of other reasons). But I definitely do not want to go around and point fingers pre-emptively in hopes that people just use my models. Our community for Dutch is already quite small so I rather just lift each other up and build on each others work through friendly "competition" than to compete in bad faith.

So I think that for my future models, I'll just make use of ultrachat+ultrafeedback, which should be cleared for training apache 2.0 models because they were created with Azure. This may negatively impact the model's performance (especially for code because it does not include the Stack Overflow set) but I hope the impact is limited.

repliedto their post almost 2 years ago

What do you mean with compliance in this context? I'm not sure how I can market being non-commercial as a good thing 😅

repliedto their post almost 2 years ago

Cool! Looking forward to what you'll build with this!

posted an update almost 2 years ago

Post

2324

🥳 New license for datasets: Apache 2.0!

I have been struggling mentally for many months now with the OpenAI terms of use that indicate that their model outputs cannot be used to build "competing models". This leads to many questions:

- what is the definition of competing? Is it the same as "commercial"?
- since this is part of the terms of use between OpenAI and the API user, can a third party still use the generated dataset to build competing models?
- are such restrictions even legal in the first place?

Trying to "follow the rules" as much as possible despite wanting to be as open as possible, I kept releasing my datasets under non-commercial licenses (which are too restrictive anyhow - nothing should prevent you from using the data in non-LM commercial settings), just like models trained on these datasets. This has put me at a competitive disadvantage compared to creators who do not follow the same approach and release their data/models on apache 2.0 despite the OpenAI "restrictions". Moreover, I fear (https://twitter.com/BramVanroy/status/1780220420316164246) that my approach blocks adaptation of my data/models for (commercial) applications/integrations.

Thankfully @Rijgersberg noted that these OpenAI terms of use are NOT explicit in the Azure OpenAI API (https://twitter.com/E_Rijgersberg/status/1780308971762450725). Since my latest datasets were created via Azure, this comes as a relief. As far as I can tell after digging through Azure docs, this allows me to change all recent GPT4-generated datasets to apache 2.0! 🥳

- BramVanroy/ultrachat_200k_dutch
- BramVanroy/orca_dpo_pairs_dutch
- BramVanroy/ultra_feedback_dutch
- BramVanroy/ultra_feedback_dutch_cleaned
- BramVanroy/no_robots_dutch

I will have to mull over what I'll do for the older GPT3.5 datasets. What do you think that I should do?

9 replies

Bram Vanroy PRO

AI & ML interests

Recent Activity

Organizations

BramVanroy's activity