AI & ML interests

None defined yet.

Recent Activity

ariG23498ย  authored a paper about 4 hours ago
FineVision: Open Data Is All You Need
sergiopaniegoย  updated a Space 4 months ago
visionLMsftw/comparevlms
merveย  updated a Space 4 months ago
visionLMsftw/comparevlms
View all activity

anditoย 
posted an update about 3 hours ago
view post
Post
17
Finally, our new paper is out! "๐—™๐—ถ๐—ป๐—ฒ๐—ฉ๐—ถ๐˜€๐—ถ๐—ผ๐—ป: ๐—ข๐—ฝ๐—ฒ๐—ป ๐——๐—ฎ๐˜๐—ฎ ๐—œ๐˜€ ๐—”๐—น๐—น ๐—ฌ๐—ผ๐˜‚ ๐—ก๐—ฒ๐—ฒ๐—ฑ"! ๐Ÿฅณ
FineVision: Open Data Is All You Need (2510.17269)

If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible.
We wanted to change that.

FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.

In the paper, we share how we built it:
๐Ÿ” finding and cleaning data at scale
๐Ÿงน removing excessive duplicates across sources
๐Ÿค— decontaminating against 66 public benchmarks

My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets.
NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!

๐ŸŽ‰ To celebrate the paper, Iโ€™m also releasing a concatenated and shuffled version of the full dataset! ๐Ÿ‘‰HuggingFaceM4/FineVision_full_shuffled

Itโ€™s ready to stream, so you can start training your own models right away:

from datasets import load_dataset
d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True)
print(next(iter(d)))

A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!
merveย 
posted an update 1 day ago
view post
Post
2240
deepseek-ai/DeepSeek-OCR is out! ๐Ÿ”ฅ my take โคต๏ธ
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
  • 2 replies
ยท
sergiopaniegoย 
posted an update 4 days ago
view post
Post
1737
New drop! ๐Ÿ’ฅ The VLM Object Understanding Comparison Space now runs with Qwen3-VL-4B and moondream3.

You can compare how models reason about images ๐Ÿง 

Bonus: thanks to @ariG23498 , you now get auto-suggested prompts to explore faster.

Letโ€™s gooo

sergiopaniego/vlm_object_understanding
sergiopaniegoย 
posted an update 4 days ago
view post
Post
760
New drop! ๐Ÿ’ฅ The VLM Object Understanding Comparison Space now runs with Qwen3-VL-4B and moondream3.



You can compare how models reason about images ๐Ÿง 

Bonus: thanks to @ariG23498 , you now get auto-suggested prompts to explore faster.

Letโ€™s gooo

sergiopaniego/vlm_object_understanding
sergiopaniegoย 
posted an update 6 days ago
view post
Post
2216
@Qwen released their new small and dense VLMs (Qwen3-VL).

They're incredibly capable and one of my all-time favourite VLMs.

๐Ÿค— Weโ€™ve prepared some resources to help you get started.

> Fine-tune Qwen3-VL-4B with SFT or GRPO (free Colab notebooks):
> SFT: https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_qwen_vl.ipynb
> GRPO: https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_qwen3_vl.ipynb

> Compare object detection vs. Moondream3:
sergiopaniego/vlm_object_understanding

> Fine-tune from the CLI using TRL:
https://github.com/kashif/Qwen3-VL/blob/trl-sft/qwen-vl-finetune/README.md#trl-based-training-single-gpu
sergiopaniegoย 
posted an update 11 days ago
view post
Post
1418
Super nice intro to fine-tuning with TRL, just dropped by @google (runs free on Colab)!

They use SFT + QLoRA to fine-tune the tiny Gemma 3 270M model for emoji generation

Hereโ€™s what the fine-tuned model generates for the prompt: โ€œI'm learning to tweetโ€ โ†’ ๐Ÿฆ๐Ÿ—ฃ๐Ÿ’ป

Colab: https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Demos/Emoji-Gemma-on-Web/resources/Fine_tune_Gemma_3_270M_for_emoji_generation.ipynb
Try it out: google/emoji-gemma
Learn more: https://developers.googleblog.com/en/own-your-ai-fine-tune-gemma-3-270m-for-on-device/
sergiopaniegoย 
posted an update 14 days ago
view post
Post
2364
Online training methods (e.g., GRPO) require real-time generation, a compute- and memory-heavy bottleneck.

TRL has built-in vLLM support and in this new recipe, we show how to leverage it for efficient online training. Run on Colab โšก, scale to multi-GPU/multi-node!

๐Ÿง‘โ€๐Ÿณ recipe: https://huggingface.co/learn/cookbook/grpo_vllm_online_training
  • 1 reply
ยท
sergiopaniegoย 
posted an update 15 days ago
view post
Post
2861
A few days ago, Thinking Machines Lab released โ€œLoRA Without Regretโ€, showing that LoRA can match full fine-tuning performance when configured right.

Naturally, we decided to reproduce the results with TRL and release a guide!

https://huggingface.co/docs/trl/main/en/lora_without_regret
sergiopaniegoย 
posted an update 20 days ago
sergiopaniegoย 
posted an update 25 days ago
view post
Post
476
You need to try this tool! ๐Ÿซก

My colleague @Molbap built an interactive HF Space to explore the modular support of open models in transformers over time

๐Ÿ‘€ Youโ€™ll spot things like ๐Ÿฆ™ llama defining many models or which ones could be modular next

Try it: Molbap/transformers-modular-refactor
sergiopaniegoย 
posted an update 26 days ago
view post
Post
467
How fast can you create an endpoint in Hugging Face Inference Endpoints with a new model + vLLM to deploy a state-of-the-art OCR model?

Letโ€™s break it down step by step.

1๏ธโƒฃ Create your endpoint
Go to Hugging Face Endpoints โ†’ + NEW
Select Deploy from Hub โ†’ rednote-hilab/dots.ocr โ†’ Configure ๐Ÿ› ๏ธ

2๏ธโƒฃ Configure hardware & container
Pick hardware: AWS/GPU/L4 โšก
Set container: vLLM ๐Ÿ‡
Click Create โœ…

3๏ธโƒฃ Update endpoint settings
Container: Container URI: vllm/vllm-openai:nightly โ†’ Update
Advanced: add flag --trust-remote-code โ†’ Update โš ๏ธ

4๏ธโƒฃ Run inference
Download the script ๐Ÿ“: ariG23498/useful-scripts
Set your HF_TOKEN and update base_url in the script.
Run it. โœ…

Your OCR model is now live via HF Inference Endpoints!
sergiopaniegoย 
posted an update 27 days ago
view post
Post
3438
๐Ÿ’ฅ Tons of new material just landed in the smol-course! ๐Ÿง‘โ€๐Ÿ’ป

> evaluation
> alignment
> VLMs
> quizzes
> assignments!
> certificates!๐Ÿ‘ฉโ€๐ŸŽ“

go learn! ๐Ÿ‘‰ https://huggingface.co/learn/smol-course/unit0/1
  • 1 reply
ยท
merveย 
posted an update 29 days ago
view post
Post
6554
large AI labs open-sourced a ton of models last week ๐Ÿ”ฅ
here's few picks, find even more here merve/sep-16-releases-68d13ea4c547f02f95842f05 ๐Ÿค
> IBM released a new Docling model with 258M params based on Granite (A2.0) ๐Ÿ“ ibm-granite/granite-docling-258M
> Xiaomi released 7B audio LM with base and instruct variants (MIT) XiaomiMiMo/mimo-audio-68cc7202692c27dae881cce0
> DecartAI released Lucy Edit, open Nano Banana ๐ŸŒ (NC) decart-ai/Lucy-Edit-Dev
> OpenGVLab released a family of agentic computer use models (3B/7B/32B) with the dataset ๐Ÿ’ป OpenGVLab/scalecua-68c912cf56f7ff4c8e034003
> Meituan Longcat released thinking version of LongCat-Flash ๐Ÿ’ญ meituan-longcat/LongCat-Flash-Thinking
  • 2 replies
ยท
sergiopaniegoย 
posted an update 29 days ago
view post
Post
1387
This summer TRL leveled up for multimodal alignment ๐ŸŒž

โœ… New VLM alignment methods (MPO, GRPO, GSPO)
โœ… Extended RLOO & Online DPO for VLMs
โœ… Native SFT support
โœ… Ready-to-use training scripts

๐Ÿ”— https://huggingface.co/blog/trl-vlm-alignment
sergiopaniegoย 
posted an update about 1 month ago
merveย 
posted an update about 1 month ago
view post
Post
3235
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face ๐Ÿ”ฅ

> not only a document converter but also can do document question answering, understand multiple languages ๐Ÿคฏ
> best part: released with Apache 2.0 license ๐Ÿ‘ use it with your commercial projects!
> it supports transformers, vLLM and MLX from the get-go! ๐Ÿค—
> built on SigLIP2 & granite-165M

model: ibm-granite/granite-docling-258M
demo: ibm-granite/granite-docling-258m-demo ๐Ÿ’—
sergiopaniegoย 
posted an update about 1 month ago
view post
Post
440
Training long-context LLMs is getting easier!

TRL now supports Context Parallelism (CP), letting you scale sequences across multiple GPUs, even multi-node setups, seamlessly ๐Ÿ’†
Combine TRL and accelerate, and you can run it effortlessly!

With 8 GPUs, CP enables 300k+ token sequences while keeping throughput reasonable.
Works for both full fine-tuning and LoRA, unlocking contexts that used to hit OOM ๐Ÿ“ˆ

Check out the full guide here ๐Ÿ‘‰ https://huggingface.co/docs/trl/main/en/distributing_training#context-parallelism

If you want to learn more about Context Parallelism, check out the Ultrascale Playbook ๐Ÿ‘‰ nanotron/ultrascale-playbook
sergiopaniegoย 
posted an update about 1 month ago
view post
Post
343
Thinking about learning the keys to post-training LLMs? ๐Ÿง

We just updated and released the smol course: the fastest track to mastering fine-tuning large language models. Free, hands-on, up-to-date, and comes with a certificate! ๐Ÿซฐ

What youโ€™ll get:
๐Ÿ“– Instruction tuning & preference alignment
๐Ÿง‘โ€๐Ÿ’ป Hands-on projects with TRL & Transformers
๐Ÿ† Challenges & community projects
๐ŸŽ“ Certificate of completion

go: hf.co/learn/smol-course
merveย 
posted an update about 1 month ago
view post
Post
1107
a ton of image/video generation models and LLMs from big labs ๐Ÿ”ฅ

> Meta released facebook/mobilellm-r1-68c4597b104fac45f28f448e, smol LLMs for on-device use ๐Ÿ’ฌ
> Tencent released tencent/SRPO, high res image generation model and tencent/POINTS-Reader, cutting edge OCR ๐Ÿ“
> ByteDance released bytedance-research/HuMo, video generation from any input โฏ๏ธ

find more models, datasets, demos here merve/sep-11-releases-68c7dbfa26bea8cd921fa0ac