4 25 72

Umitcan Sahin PRO

ucsahin

AI & ML interests

Visual Language Models, Large Language Models, Vision Transformers

Recent Activity

Reacted to merve's post with 🔥 about 24 hours ago

The authors of ColPali trained a retrieval model based on SmolVLM 🤠 https://huggingface.co/vidore/colsmolvlm-alpha TLDR; - ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks - ColSmolVLM is more memory efficient than ColQwen2 💗

Reacted to merve's post with 👀 about 24 hours ago

Reacted to merve's post with 👍 about 24 hours ago

View all activity

Organizations

None yet

ucsahin's activity

Reacted to merve's post with 🔥👀👍 about 24 hours ago

Post

1085

The authors of ColPali trained a retrieval model based on SmolVLM 🤠 vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 💗

liked a model 1 day ago

AIDC-AI/Marco-o1

Text Generation • Updated 5 days ago • 5.66k • 513

Reacted to ezgikorkmaz's post with 🚀 7 days ago

Post

2073

I wrote a recent survey about deep reinforcement learning. The paper is a compact guide to understand some of the key concepts in reinforcement learning. Find the paper below:

Paper: https://arxiv.org/pdf/2401.02349v2
Twitter: https://x.com/EzgiKorkmazAI/status/1851934161138798615

liked 2 datasets 8 days ago

microsoft/orca-agentinstruct-1M-v1

Viewer • Updated 28 days ago • 1.05M • 3.39k • 369

mlabonne/orca-agentinstruct-1M-v1-cleaned

Viewer • Updated 10 days ago • 1.05M • 1.1k • 46

liked a model 10 days ago

google/siglip-so400m-patch16-256-i18n

Zero-Shot Image Classification • Updated 10 days ago • 686 • 25

upvoted a collection 11 days ago

SigLIP

Collection

Contrastive (sigmoid) image-text models from https://arxiv.org/abs/2303.15343 • 10 items • Updated 11 days ago • 37

Reacted to merve's post with 🤗👀🔥 11 days ago

Post

4795

OmniVision-968M: a new local VLM for edge devices, fast & small but performant
💨 a new vision language model with 9x less image tokens, super efficient
📖 aligned with DPO for reducing hallucinations
⚡️ Apache 2.0 license 🔥

Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model NexaAIDev/omnivision-968M

4 replies

Reacted to merve's post with 🔥 13 days ago

Post

1952

Amazing past days at open ML, it's raining coding models, let's have a recap 🌧️ Find all models and datasets here merve/nov-15-releases-67372d0ebdc354756a52ecd0

Models
💻 Coding: Qwen team released two Qwen2.5-Coder checkpoints of 32B and 7B. Infly released OpenCoder: 1.5B and 8B coding models with instruction SFT'd versions and their datasets! 💗

🖼️ Image/Video Gen: Alibaba vision lab released In-context LoRA -- 10 LoRA models on different themes based on Flux. Also Mochi the sota video generation model with A2.0 license now comes natively supported in diffusers 👏

🖼️ VLMs/Multimodal: NexaAIDev released Omnivision 968M a new vision language model aligned with DPO for reducing hallucinations, also comes with GGUF ckpts 👏 Microsoft released LLM2CLIP, a new CLIP-like model with longer context window allowing complex text inputs and better search

🎮 AGI?: Etched released Oasis 500M, a diffusion based open world model that takes keyboard input and outputs gameplay 🤯

Datasets
Common Corpus: A text dataset with 2T tokens with permissive license for EN/FR on various sources: code, science, finance, culture 📖

upvoted a collection 13 days ago

Nov 15 Releases 🍂

Collection

15 items • Updated 13 days ago • 6

liked a model 13 days ago

NexaAIDev/omnivision-968M

Updated 41 minutes ago • 9.7k • 431

Reacted to merve's post with 🔥 27 days ago

Post

5382

Another great week in open ML!
Here's a small recap 🫰🏻

Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2

💬 Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M

🖼️ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license

🖼️💬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!

Dataset releases
🖼️ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2

liked 2 models 30 days ago

gokaygokay/Flux-Seamless-Texture-LoRA

Text-to-Image • Updated 30 days ago • 1.14k • • 17

gokaygokay/Flux-Double-Exposure-LoRA

Text-to-Image • Updated about 1 month ago • 1.04k • • 12

Reacted to merve's post with 🚀🔥 about 1 month ago

Post

3456

Microsoft released a groundbreaking model that can be used for web automation, with MIT license 🔥 microsoft/OmniParser

Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents.

no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.

Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. 👏

OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing.