AI & ML interests

None defined yet.

Recent Activity

merveΒ 
posted an update 3 days ago
view post
Post
5539
large AI labs have dropped so many open models last week πŸ”₯ don't miss out on them

β†’ Apple released on-device vision LMs apple/fastvlm-68ac97b9cd5cacefdd04872e & apple/mobileclip2-68ac947dcb035c54bcd20c47
β†’ OpenGVLab released InternVL3.5, 32 new vision LMs with one based on gpt-oss! (OS) OpenGVLab/internvl35-68ac87bd52ebe953485927fb
β†’ MSFT released a killer small TTS model (OS) microsoft/VibeVoice-1.5B

find more herehttps://huggingface.co/collections/merve/august-29-releases-68b5a3754cfb8abf59e2b486
  • 1 reply
Β·
merveΒ 
posted an update 9 days ago
view post
Post
5845
first vision language model built off openai/gpt-oss-20b just dropped! πŸ”₯

InternVL3.5 comes with 32 models 🀯 pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb
comes with gpt-oss or Qwen3 for LLM part ‡️
  • 1 reply
Β·
merveΒ 
posted an update 28 days ago
view post
Post
3238
GPT-4.1-mini level model right in your iPhone 🀯

openbmb/MiniCPM-V-4 is only 4B while surpassing GPT-4.1-mini in vision benchmarks πŸ”₯

allows commercial use as well!
merveΒ 
posted an update 30 days ago
view post
Post
1124
we're all sleeping on this OCR model rednote-hilab/dots.ocr πŸ”₯

dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🀯

single e2e model to extract image, convert tables, formula, and more into markdown πŸ“
try it MohamedRashad/Dots-OCR
merveΒ 
posted an update about 1 month ago
view post
Post
658
massive releases and tons of Flux 1. Krea LoRas past week!
here's some of the picks, find more models in collection 🫑 merve/releases-august-2-6890c14248203522b7d0267f

LLMs πŸ’¬
> Tencent dropped tencent/Hunyuan-7B-Instruct
> Qwen released Qwen/Qwen3-Coder-30B-A3B-Instruct, 30B MoE with 3B params for coding (OS)

vision/multimodal
> RedNote released rednote-hilab/dots.ocr - 3B OCR model (OS)
> Cohere released CohereLabs/command-a-vision-07-2025 - 112B (dense!) VLM for 6 languages
> StepFun-AI shipped stepfun-ai/step3 - 321B MoE VLM (OS)
> Skywork shipped Skywork/Skywork-UniPic-1.5B - new any-to-any model (image+text β†’ image+text) (OS)
merveΒ 
posted an update about 1 month ago
merveΒ 
posted an update about 1 month ago
view post
Post
3614
past week in open AI was insane πŸ”₯ here's some of picks, find more here merve/releases-july-25-688768ca47fe3693407e02d1

πŸ’¬ LLMs & VLMs
> Qwen/Qwen3-235B-A22B-Thinking-2507 had a new update (OS)
> Qwen/Qwen3-Coder-480B-A35B-Instruct is out with 480B total 35B active params 🀯 (OS)
> AllenAI dropped an update to allenai/olmOCR-7B-0725 πŸ“
> InternLM released internlm/Intern-S1 - 235B Qwen3 MoE + 6B InternViT encoder (OS)
> OmniSVG/OmniSVG is a new SVG generation VLM (OS)

πŸ–ΌοΈ image/video/3D generation
> WanAI released Wan2.2 series - both T2V and I2V 14B models for high-quality video generation (OS) multimodalart/wan-22-688767e313337b434ed55112
> Tencent dropped tencent/HunyuanWorld-1 - image-to-3D scene generation model
  • 1 reply
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
4370
🀯 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT 😍

benchmarks look great (πŸ‘‘ best model βœ… best open model)
merveΒ 
posted an update about 1 month ago
view post
Post
817
so many open LLMs and image LoRAs dropped past week, here's some picks for you 🫑 merve/releases-july-18-687e3fbd2ab9b39c51f9238b

LLMs
> ByteDance released a bunch of translation models called Seed-X-RM (7B) ByteDance-Seed/Seed-X-RM-7B
> NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license πŸ‘ nvidia/openreasoning-nemotron-687730dae0170059860f1f01
> LG released a new EXAONE model (32B) LGAI-EXAONE/EXAONE-4.0-32B

VLMs/any-to-any
> vidore/colqwen-omni-v0.1 is a new any-to-any retriever (MIT)
> HiDream-ai/HiDream-E1-1 is image+text in image+text out model (MIT)

LoRAs
> There's a bunch of LoRAs based on Flux Kontext, gotta check out the collection 🀠
merveΒ 
posted an update about 1 month ago
merveΒ 
posted an update about 2 months ago
merveΒ 
posted an update about 2 months ago
view post
Post
2640
Fine-tune Gemma3n on videos with audios inside with Colab A100 πŸ”₯
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫑 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! πŸ™πŸ» merve/smol-vision
  • 1 reply
Β·
merveΒ 
posted an update about 2 months ago
view post
Post
2458
past week had huuuge releases πŸ’—
here's our picks πŸ”₯ find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🀯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode πŸ’­ as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA
merveΒ 
posted an update about 2 months ago
view post
Post
3147
GitHub refuses to render notebooks for a long time now πŸ’”

so smol-vision now lives in Hugging Face model repository πŸ€— merve/smol-vision
  • 1 reply
Β·
merveΒ 
posted an update about 2 months ago
view post
Post
3477
ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source πŸ‘ ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🀯
merveΒ 
posted an update about 2 months ago
view post
Post
3709
Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫑
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
πŸ§‘πŸ»β€πŸ’» apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
πŸ—£οΈ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
πŸ‘€ aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
πŸ“– racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset
  • 1 reply
Β·
merveΒ 
posted an update 2 months ago
view post
Post
978
SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week πŸ€—

> ByteDance/XVerse is a new identity preserving image generation model πŸ–ΌοΈ
> google/gemma-3n-E4B-it, any-to-text model supported by transformers πŸ€—
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers πŸ“‘
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c
merveΒ 
posted an update 2 months ago
view post
Post
3056
visual reasoning is now in transformers πŸ”₯
https://huggingface.co/THUDM/GLM-4.1V-9B-Thinking is just released and merged into transformers, we gave it a vibe test run 🀠

it's very good, comes with 64k context length and MIT license 😍
it supports 4k image tokens and any aspect ratio as well!
Notebook: http://colab.research.google.com/drive/1atODIiV57hOZLv16Bjzwd6fwx0yoTorj?usp=sharing
Demo: https://huggingface.co/spaces/THUDM/GLM-4.1V-9B-Thinking-Demo