AI & ML interests

computer-vision, image-processing, machine-learning, deep-learning

Recent Activity

merve 
posted an update 4 days ago
view post
Post
5735
large AI labs have dropped so many open models last week 🔥 don't miss out on them

→ Apple released on-device vision LMs apple/fastvlm-68ac97b9cd5cacefdd04872e & apple/mobileclip2-68ac947dcb035c54bcd20c47
→ OpenGVLab released InternVL3.5, 32 new vision LMs with one based on gpt-oss! (OS) OpenGVLab/internvl35-68ac87bd52ebe953485927fb
→ MSFT released a killer small TTS model (OS) microsoft/VibeVoice-1.5B

find more herehttps://huggingface.co/collections/merve/august-29-releases-68b5a3754cfb8abf59e2b486
  • 1 reply
·
merve 
posted an update 10 days ago
kadirnar 
posted an update 11 days ago
view post
Post
1685
What can you do with the VyvoTTS library?

- You can train a model in a language it has never been trained in using the PT model. There’s no need for large datasets.
- With the PT model, you can easily replicate the voice of any character you want. Just 1k samples are enough.
- You can add emotion support with a small dataset.

Github: https://github.com/Vyvo-Labs/VyvoTTS
HuggingFace: Vyvo
Nymbo 
posted an update 11 days ago
view post
Post
639
I built a general use MCP space ~ Fetch webpages, DuckDuckGo search, Python code execution, Kokoro TTS, Image Gen, Video Gen.

# Tools

1. Fetch webpage
2. Web search via DuckDuckGo (very concise, low excess context)
3. Python code executor
4. Kokoro-82M speech generation
5. Image Generation (use any model from HF Inference Providers)
6. Video Generation (use any model from HF Inference Providers)

The first four tools can be used without any API keys whatsoever. DDG search is free and the code execution and speech gen is done on CPU. Having a HF_READ_TOKEN in the env variables will show all tools. If there isn't a key present, The Image/Video Gen tools are hidden.

Nymbo/Tools
Nymbo 
posted an update 19 days ago
view post
Post
911
Anyone using Jan-v1-4B for local MCP-based web search, I highly recommend you try out Intelligent-Internet/II-Search-4B

Very impressed with this lil guy and it deserves more downloads. It's based on the original version of Qwen3-4B but find that it questions reality way less often. Jan-v1 seems to think that everything it sees is synthetic data and constantly gaslights me
ZennyKenny 
posted an update 23 days ago
merve 
posted an update 29 days ago
view post
Post
3239
GPT-4.1-mini level model right in your iPhone 🤯

openbmb/MiniCPM-V-4 is only 4B while surpassing GPT-4.1-mini in vision benchmarks 🔥

allows commercial use as well!
merve 
posted an update about 1 month ago
view post
Post
1125
we're all sleeping on this OCR model rednote-hilab/dots.ocr 🔥

dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🤯

single e2e model to extract image, convert tables, formula, and more into markdown 📝
try it MohamedRashad/Dots-OCR
merve 
posted an update about 1 month ago
view post
Post
658
massive releases and tons of Flux 1. Krea LoRas past week!
here's some of the picks, find more models in collection 🫡 merve/releases-august-2-6890c14248203522b7d0267f

LLMs 💬
> Tencent dropped tencent/Hunyuan-7B-Instruct
> Qwen released Qwen/Qwen3-Coder-30B-A3B-Instruct, 30B MoE with 3B params for coding (OS)

vision/multimodal
> RedNote released rednote-hilab/dots.ocr - 3B OCR model (OS)
> Cohere released CohereLabs/command-a-vision-07-2025 - 112B (dense!) VLM for 6 languages
> StepFun-AI shipped stepfun-ai/step3 - 321B MoE VLM (OS)
> Skywork shipped Skywork/Skywork-UniPic-1.5B - new any-to-any model (image+text → image+text) (OS)
merve 
posted an update about 1 month ago
merve 
posted an update about 1 month ago
view post
Post
3614
past week in open AI was insane 🔥 here's some of picks, find more here merve/releases-july-25-688768ca47fe3693407e02d1

💬 LLMs & VLMs
> Qwen/Qwen3-235B-A22B-Thinking-2507 had a new update (OS)
> Qwen/Qwen3-Coder-480B-A35B-Instruct is out with 480B total 35B active params 🤯 (OS)
> AllenAI dropped an update to allenai/olmOCR-7B-0725 📝
> InternLM released internlm/Intern-S1 - 235B Qwen3 MoE + 6B InternViT encoder (OS)
> OmniSVG/OmniSVG is a new SVG generation VLM (OS)

🖼️ image/video/3D generation
> WanAI released Wan2.2 series - both T2V and I2V 14B models for high-quality video generation (OS) multimodalart/wan-22-688767e313337b434ed55112
> Tencent dropped tencent/HunyuanWorld-1 - image-to-3D scene generation model
  • 1 reply
·
merve 
posted an update about 1 month ago
view post
Post
4370
🤯 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT 😍

benchmarks look great (👑 best model ✅ best open model)
merve 
posted an update about 1 month ago
view post
Post
817
so many open LLMs and image LoRAs dropped past week, here's some picks for you 🫡 merve/releases-july-18-687e3fbd2ab9b39c51f9238b

LLMs
> ByteDance released a bunch of translation models called Seed-X-RM (7B) ByteDance-Seed/Seed-X-RM-7B
> NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license 👏 nvidia/openreasoning-nemotron-687730dae0170059860f1f01
> LG released a new EXAONE model (32B) LGAI-EXAONE/EXAONE-4.0-32B

VLMs/any-to-any
> vidore/colqwen-omni-v0.1 is a new any-to-any retriever (MIT)
> HiDream-ai/HiDream-E1-1 is image+text in image+text out model (MIT)

LoRAs
> There's a bunch of LoRAs based on Flux Kontext, gotta check out the collection 🤠
AtAndDev 
posted an update about 1 month ago
view post
Post
455
Qwen 3 Coder is a personal attack to k2, and I love it.
It achieves near SOTA on LCB while not having reasoning.
Finally people are understanding that reasoning isnt necessary for high benches...

Qwen ftw!

DECENTRALIZE DECENTRALIZE DECENTRALIZE
merve 
posted an update about 2 months ago
merve 
posted an update about 2 months ago
merve 
posted an update about 2 months ago
view post
Post
2640
Fine-tune Gemma3n on videos with audios inside with Colab A100 🔥
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision
  • 1 reply
·
merve 
posted an update about 2 months ago
view post
Post
2458
past week had huuuge releases 💗
here's our picks 🔥 find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🤯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode 💭 as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA