AI & ML interests

Apple ml-explore Computer Vision community

prithivMLmods 
posted an update 3 days ago
prithivMLmods 
posted an update 7 days ago
view post
Post
1843
Now you can try all the latest state-of-the-art multimodal vision-language models from the Qwen3-VL series demo on Hugging Face Spaces — including 4B, 8B, and 30B (Instruct, 4B-Thinking) variants. I’ve also uploaded the weights for the Abliterated variants of these models, up to 30B parameters. Check out the Spaces and model links below! 🤗🔥

✨ Qwen3-VL[4B,8B]: prithivMLmods/Qwen3-VL-Outpost
✨ Qwen3-VL-30B-A3B-Demo: prithivMLmods/Qwen3-VL-HF-Demo
✨ Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Qwen3-VL Abliterated Model Collection [ Version 1.0 ]

✨ Qwen3-VL-8B-Instruct-abliterated: prithivMLmods/Qwen3-VL-8B-Instruct-abliterated
✨ Qwen3-VL-4B-Instruct-abliterated: prithivMLmods/Qwen3-VL-4B-Instruct-abliterated
✨ Qwen3-VL-8B-Thinking-abliterated: prithivMLmods/Qwen3-VL-8B-Thinking-abliterated
✨ Qwen3-VL-4B-Thinking-abliterated: prithivMLmods/Qwen3-VL-4B-Thinking-abliterated
✨ Qwen3-VL-30B-A3B-Instruct-abliterated: prithivMLmods/Qwen3-VL-30B-A3B-Instruct-abliterated
✨ Qwen3-VL-30B-A3B-Thinking-abliterated: prithivMLmods/Qwen3-VL-30B-A3B-Thinking-abliterated

⚡Collection: prithivMLmods/qwen3-vl-abliteration-oct-1625-68f0e3e567ef076594605fac

Note: This is version 1.0 of the Abliteration of the Qwen3-VL series of models. It may perform sub-optimally in some cases. If you encounter any issues, please open a discussion.
prithivMLmods 
posted an update 9 days ago
view post
Post
3031
Introducing Image-Guard-2.0, an experimental, lightweight vision-language encoder model with a size of 0.1B (<100M parameters), trained on SigLIP2 (siglip2-base-patch16-224). Designed for multi-label image classification tasks, this model functions as an image safety system, serving as an image guard or moderator across a wide range of categories, from anime to realistic imagery.

⚡blog-article: https://huggingface.co/blog/prithivMLmods/image-guard-models

It also performs strict moderation and filtering of artificially synthesized content, demonstrating strong detection and handling of explicit images. Image-Guard-2.0 delivers robust performance in streamlined scenarios, ensuring reliable and effective classification across diverse visual inputs.
prithivMLmods 
posted an update 12 days ago
view post
Post
3293
The demo of Qwen3-VL-30B-A3B-Instruct, the next-generation and powerful vision-language model in the Qwen series, delivers comprehensive upgrades across the board — including superior text understanding and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. 🤗🔥

⚡ Space / App: prithivMLmods/Qwen3-VL-HF-Demo

The model’s demo supports a wide range of tasks, including;
Image Inference, Video Inference, PDF Inference, Image Captioning (VLA), GIF Inference.

⚡ Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Thanks for granting the blazing-fast Zero GPU access, @merve 🙏

⚡ Other Pages

> Github: https://github.com/prithivsakthiur/qwen3-vl-hf-demo
> Multimodal VLMs July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
> VL caption — < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391
> Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd

To know more about it, visit the app page or the respective model page!!
prithivMLmods 
posted an update 16 days ago
view post
Post
446
Introducing the next-gen version of DeepCaption-VLA (v2.0) — an advanced, multimodal model based on Qwen2.5-VL, specialized for Image Captioning and Vision Language Attribution (VLA). This enhanced release focuses on generating precise, attribute-rich captions that capture visual properties, object attributes, and scene details across diverse image types and aspect ratios. Version 2.0 introduces significant improvements in multilingual inference, delivering higher captioning quality and attribution accuracy in languages including Chinese (Zh), Thai (Th), and more.

🤗 DeepCaption-VLA (v2.0) : prithivMLmods/DeepCaption-VLA-V2.0-7B
🫱 Collection : prithivMLmods/vlm-20-oct-0825-68e606aa6e3993be8a3b1d51
⭐ GitHub (notebook) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption_VLA_V2_0_7B/DeepCaption_VLA_V2_0_7Bipynb.ipynb

Other Pages⚡

➥ Multimodal VLMs July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
➥ VL caption — < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391
➥ Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd

To know more about it, visit the app page or the respective model page!!
prithivMLmods 
posted an update 17 days ago
view post
Post
2774
Have built the new Image Studio with the Gemini Image Gen models for the following multiple tasks: imagen-4.0-fast-generate-001 model for Image Generation (Text-to-Image) and Multi-Image Editing (Image-to-Image), and Draw-to-Image powered by gemini-2.5-flash-image (aka Nano Banana).

⭐ Gemini-Image-Studio: prithivMLmods/Gemini-Image-Studio (Latest)
🤞 Old-App: prithivMLmods/Nano-Banana-AIO
🥊 GitHub: https://github.com/prithivsakthiur/gemini-image-studio-hf

To proceed, you need to add your Gemini API key. Your API key is stored only for the duration of your session and will be lost when you reload or exit the page. It will not be shared or exposed anywhere.
prithivMLmods 
posted an update 21 days ago
view post
Post
4521
Try the Hugging Face Space demo for Logics-MLLM/Logics-Parsing, the latest multimodal VLM from the Logics Team at Alibaba Group. It enables end-to-end document parsing with precise content extraction in markdown format, and it also generates a clean HTML representation of the document while preserving its logical structure. 🤗🔥

Additionally, I’ve integrated one of my recent works — prithivMLmods/Gliese-OCR-7B-Post1.0 — which also excels at document comprehension.

⭐ Space / App : prithivMLmods/VLM-Parsing
📄 Technical Report by the Logics Team, Alibaba Group : Logics-Parsing Technical Report (2509.19760)
🖖 MM: VLM-Parsing: prithivMLmods/mm-vlm-parsing-68e33e52bfb9ae60b50602dc
⚡ Collections : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Other Pages:

➔ Multimodal VLMs - July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
➔ Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd
➔ VL caption — < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391

.
.
.
To know more about it, visit the app page or the respective model page!!
prithivMLmods 
posted an update 25 days ago
view post
Post
1187
Try Banana Zoom an advanced image enhancement web app that lets users select regions of an image for AI-powered upscaling and detail refinement. Using Google’s (nano banana), it analyzes selections, generates context-aware enhancements, and produces high-resolution outputs. Simply drag-and-drop or upload images, make precise or fixed-size selections, and watch improvements in real-time with smooth zoom and pixel-dissolve effects.

Space / App: prithivMLmods/Banana-Zoom
Collection: https://huggingface.co/collections/prithivMLmods/image-gen-apps-diffusion-lastupdated-09-23-68a2f4c5ef3e5e394eacc20a
GitHub: https://github.com/prithivsakthiur/banana-zoom

Your API will be automatically destroyed once you refresh the app or exit it, so each user's API will be cycled in this way.
prithivMLmods 
posted an update about 1 month ago
view post
Post
4400
Photo-Mate-i2i – a space for experimenting with adapters for image manipulation using Kontext adapters, including Photo-Restore-i2i, PhotoCleanser-i2i, Polaroid-Warm-i2i, Yarn-Photo-i2i, Monochrome-Pencil, and more. Try out the demo, and to learn more, visit the app page or the respective model pages!

⚡Demo: prithivMLmods/Photo-Mate-i2i
⚙️How to Use: prithivMLmods/Photo-Mate-i2i#2
👨‍🔧i2i-Kontext(Experimental LoRAs): prithivMLmods/i2i-kontext-exp-68ce573b5c0623476b636ec7

prithivMLmods 
posted an update about 1 month ago
view post
Post
5214
Dropping some experimental adapters for FLUX.1-Kontext-dev, including Photo-Restore-i2i, PhotoCleanser-i2i, Polaroid-Warm-i2i, Yarn-Photo-i2i, and Monochrome-Pencil. These were trained under various settings with minimal image pairs to achieve optimal results. The dataset result sets end pairs were synthesized using Gemini-2.5-Flash-Image-Preview and others.🤗✨

prithivMLmods/PhotoCleanser-i2i: Remove objects while preserving the rest of the image.
prithivMLmods/Photo-Restore-i2i: Restore old photos into moderately colorized, detailed images.
prithivMLmods/Polaroid-Warm-i2i: Seamless vintage Polaroid-style images with warm, faded tones.
prithivMLmods/Yarn-Photo-i2i: Convert images into yarn-stitched artwork while retaining key details.
prithivMLmods/Monochrome-Pencil: Turn images into monochrome pencil sketches while keeping original features.

✨Note: All the above models share the same auto-labeling multimodal VLM captioning model, prithivMLmods/DeepCaption-VLA-7B, which is used for refining edit instructions and accurately understanding attributions for the generations.

✨Collection: prithivMLmods/i2i-kontext-exp-68ce573b5c0623476b636ec7

.
.
.
To know more about it, visit the app page or the respective model page!!
prithivMLmods 
posted an update about 1 month ago
view post
Post
1591
Many of 'em pinged me asking to make the nano-banana-aio to available on hf.co/spaces, so I’ve transferred the app’s tech stack to make it compatible for deployment on Spaces. (Can be accessed with your own Gemini API) 🤗⭐️

✦ Yes, it is now available on Spaces: prithivMLmods/Nano-Banana-AIO

Nano Banana AIO (All-in-One) App, which offers seamless image manipulation features, including single/multiple image adaptation, a canvas for free-style drawing to creative image generation, and standard text-to-image generation.

All in One Banana for you! 😉
prithivMLmods 
posted an update about 1 month ago
view post
Post
3101
I'm a Hugging Face Fellow now, guys!🤗❤️

With the same passion, trust, and momentum to contribute to the community, I’m excited to do some amazing things to wrap up Q3 and Q4 of 2025. And importantly, I’ve been lucky enough to receive some knowledge and guidance from @merve to build open-source demos and stuff. Thank you for the belief.

Thank you — much love.
Long live open source!

— Prithiv
prithivMLmods 
posted an update about 1 month ago
view post
Post
7190
Introducing Gliese-OCR-7B-Post1.0, a document content-structure retrieval VLM designed for content extraction(OCRs) and summarization. This is the third model in the Camel Doc OCR VLM series, following Camel-Doc-OCR-062825. The new version fixes formal table reconstruction issues in both En and Zh, achieving optimal performance for long-context inferences. This model also shows significant improvements in LaTeX and Markdown rendering for OCR tasks.

🤗 Gliese-OCR-7B-Post1.0 : prithivMLmods/Gliese-OCR-7B-Post1.0
📌 Gliese-Post1.0 Collection : prithivMLmods/gliese-post10-68c52c4a6ca4935f5259a6d7
⬅️ Previous Versions : prithivMLmods/Camel-Doc-OCR-062825
🧨 Gliese-OCR-7B-Post1.0 (4-bit) Notebook Demo on T4 : prithivMLmods/Gliese-OCR-7B-Post1.0
📖 GitHub [Gliese-OCR-7B-Post1.0(4-bit)-reportlab] : https://tinyurl.com/ys7zuerc

Other Collections:

➔ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
➔ Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd
➔ Multimodal VLMs - July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the app page or the respective model page!!
  • 2 replies
·
prithivMLmods 
posted an update about 1 month ago
view post
Post
3114
The POINTS-Reader, a vision-language model for end-to-end document conversion, is a powerful, distillation-free Vision-Language Model that sets new SoTA benchmarks. The demo is now available on HF (Extraction, Preview, Documentation). The input consists of a fixed prompt and a document image, while the output contains only a string (the text extracted from the document image). 🔥🤗

✦ Space/App: prithivMLmods/POINTS-Reader-OCR
✦ Model: tencent/POINTS-Reader
✦ Paper: https://arxiv.org/pdf/2509.01215

🤗 The app is done and ready to go brrrr with zero GPU. Thankyou @merve

.
.
.
To know more about it, visit the app page or the respective model page!!
  • 4 replies
·
prithivMLmods 
posted an update about 1 month ago
view post
Post
3869
Build something cool with Nano Banana aka Gemini 2.5 Flash Image AIO [All-in-One]. Draw and transform on canvas, edit images, and generate images—all in one place!🍌

✦︎ Constructed with the Gemini API (GCP). Try it here: prithivMLmods/Nano-Banana-AIO (Added the Space recently! - Sep 18 '25)
  • 4 replies
·
prithivMLmods 
posted an update about 2 months ago
view post
Post
6506
Dropped the HeadshotX : a super-realistic headshot adapter for Qwen/Qwen-Image, an image generation model by Qwen. It is an advanced LoRA adaptation of the Qwen-Image model and an upgraded version of prithivMLmods/Qwen-Image-Studio-Realism, offering more precise portrait rendering with a strong focus on realism. The model was trained on diverse face types from across the world, labeled with florence2-en and caption-optimized using prithivMLmods/DeepCaption-VLA-7B. 11(types) × 5 different face types: Asian, Hispanic, Caucasian, Latina, Middle Eastern, etc.

⮞ Model🤗: prithivMLmods/Qwen-Image-HeadshotX

⮞ The Previous Adapter (LoRA): prithivMLmods/Qwen-Image-Studio-Realism

⮞ Collection: prithivMLmods/qwen-image-exp-lora-68a978fe11400bc3165b0c4d

.
.
.
To know more about it, visit the app page or the respective model page!!
  • 2 replies
·
prithivMLmods 
posted an update about 2 months ago
view post
Post
3399
Comparing: DeepCaption-VLA-7B, built on Qwen2.5-VL-7B-Instruct, is tailored for image captioning and vision-language attribution, focusing on precise, descriptive captions of visual properties, object attributes, and scene details. In contrast, Qwen2.5-VL-7B-Abliterated-Caption-it is fine-tuned for abliterated captioning, generating highly detailed descriptions across diverse visual categories.

Models🤗
✦ DeepCaption-VLA-7B : prithivMLmods/DeepCaption-VLA-7B
✦ Qwen2.5-VL-7B-Abliterated-Caption-it : prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it

Spaces⛵
➜ VisionScope-R2 : prithivMLmods/VisionScope-R2
➜ Qwen2.5-VL-Outpost : https://huggingface.co/spaces/prithivMLmods/Qwen2.5-VL-Outpost

Collection🗞️
DeepCaption attr. : prithivMLmods/deepcaption-attr-68b041172ebcb867e45c556a
VL Abliterated-Caption : prithivMLmods/vl-abliterated-caption-68a0443b63182e97a15c47a3
Multimodal VLMs - Until July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

GitHub↗️
> DeepCaption-VLA-7B [4bit-notebook demo] : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb
> Qwen2.5-VL-3B-Abliterated-Caption-it(caption) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Qwen2.5-VL-3B-Abliterated-Caption-it(caption)/Qwen2_5_VL_3B_Abliterated_Caption_it.ipynb

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the app page or the respective model page!!
prithivMLmods 
posted an update about 2 months ago
view post
Post
5457
FastVLMs by Apple are the talk of the week for edge device VLMs and also for consumer-grade VLMs on the Hub. They have some impressive demos available on the Hub for live captioning and inference tasks. Meanwhile, I’m still exploring one of the coolest edge-device multimodal releases—Liquid AI’s LFM2-VL (450M and 1.6B). I’ve also made a live camera video inference demo, which is capable of running on Colab’s free-tier T4 GPU.

🤗Live Captioning Notebooks:
➠ LiquidAI LFM2 VL 1.6B Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LiquidAI-LFM2-VL-Live-Cam/LiquidAI_LFM2_VL_1_6B_Live_Cam.ipynb

➠ LiquidAI LFM2 VL 450M Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LiquidAI-LFM2-VL-Live-Cam/LiquidAI_LFM2_VL_450M_Live_Cam.ipynb

✨I also made a demo for the FastVLM Live Captioning Notebook.
➠ FastVLM 0.5B Live Cam: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Apple-FastVLM-0.5B-Live-Cam/apple_FastVLM_0_5B_live_cam.ipynb

↗️For more notebooks, kindly visit the following repositories.
➠ Multimodal Outpost Notebooks: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks

Feel free to fork, modify, and explore!
prithivMLmods 
posted an update about 2 months ago
view post
Post
3442
Introducing prithivMLmods/DeepCaption-VLA-7B, a multimodal VLM designed for reasoning with long-shot captions (Captioning and Vision-Language Attribution). It focuses on defining visual properties, object attributes, and scene details across a wide spectrum of images and aspect ratios, generating attribute-rich image captions. The model supports creative, artistic, and technical applications that require detailed descriptions. 🤗🔥

✦︎ Models: prithivMLmods/DeepCaption-VLA-7B, also includes prithivMLmods/DeepAttriCap-VLA-3B, an experimental model for vision-language attribution.

✦︎ Try the demo here: prithivMLmods/VisionScope-R2

✦︎ Try it now on Google Colab, with support for T4 GPUs in 4-bit quant_type: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption-VLA-7B%5B4bit%20-%20notebook%20demo%5D/DeepCaption-VLA-7B.ipynb

✦︎ Collection: prithivMLmods/deepcaption-attr-68b041172ebcb867e45c556a

.
.
.

To know more about it, visit the model card of the respective model. !!
  • 4 replies
·
prithivMLmods 
posted an update about 2 months ago
view post
Post
1271
OpenGVLab's InternVL3.5 is a new family of open-source multimodal models that have advanced versatility, reasoning, and efficiency. I have created 𝐝𝐞𝐦𝐨 𝐧𝐨𝐭𝐞𝐛𝐨𝐨𝐤𝐬 for models ranging from 1B to 4B parameters, available in multiple versions (MPO, Instruct, Pre-trained) and in both "thinking" and "non-thinking" settings, with experimental compatibility for 𝐓𝐞𝐬𝐥𝐚 𝐓𝟒 GPUs.

➠InternVL3_5_2B_MPO_Thinking: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3.5-Thinking/1_InternVL3_5_2B_MPO_Thinking/1_InternVL3_5_2B_MPO_Thinking.ipynb
➠InternVL3_5_1B_Instruct_Thinking: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3.5-Thinking/2_InternVL3_5_1B_Instruct_Thinking/2_InternVL3_5_1B_Instruct_Thinking.ipynb

➠InternVL3_5-1B-MPO: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-MPO/InternVL3_5-1B-MPO/InternVL3_5_1B_MPO.ipynb
➠InternVL3_5-2B-MPO: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/tree/main/InternVL-3.5-Notebook/InternVL3_5-MPO/InternVL3_5-2B-MPO

➠InternVL3_5-1B-Instruct: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-Instruct/InternVL3_5-1B-Instruct/InternVL3_5_1B_Instruct.ipynb
➠InternVL3_5-2B-Instruct: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-Instruct/InternVL3_5-2B-Instruct/InternVL3_5_2B_Instruct.ipynb

➠InternVL3_5-1B-Pretrained: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-Pretrained/InternVL3_5-1B-Pretrained/InternVL3_5_1B_Pretrained.ipynb
➠InternVL3_5-2B-Pretrained: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/InternVL-3.5-Notebook/InternVL3_5-Pretrained/InternVL3_5-2B-Pretrained/InternVL3_5_2B_Pretrained.ipynb

no flash_attention