metadata

title: Conversational Image Recognition Chatbot
emoji: 🌖
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: true
license: mit
short_description: Conversational Image Recognition chatbot

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Conversational Image Recognition Chatbot

Introduction

This application combines natural language processing with image recognition technology that involves the image recognition downstream task and extraction of visual information to interact with users. It allows users to upload images and then engage in a conversation to answer questions about the content of those images. This tool is designed to provide a seamless and interactive way to understand and analyze visual content through natural language dialogue.

Solution we offer

● Chatbot harnessing the power of Vision Language Model (VLM) & Zero-shot object detection Model ● The user can upload an image, detect the objects in it and start the chat session. ● Enhanced spatial understanding of objects in the images.It happens due to inter-communication between both the models. ● Image question answering chatbot with feature of object detection. ● Chat bot History and detection output interaction of the system. ● Used 9 representative VLMs on 10 Benchmarks in Open compass multimodal leaderboard. ● Best performing and most used object detection models like OWLv2 model, Grounding DINO model.

How it addresses the problem

● Both the models help in addressing the problem since they were pretrained on household datasets. ● Provide unparalleled results and inference speed. ● Our approach is easy and can be implemented with minimal efforts.

Unique value propositions

● Correct responses lexically and grammatically. ● Usage of state of the art and novel research work in field of image understanding. Everything we have used is open sourcework. ● Flexibility to use different models with reference documentation. ● Working and hosted application

Technologies used

Programming languages : Python Libraries : Transformers, Pytorch, Image libraries like PIL Hardware : Nvidia T4 medium 8vCPU 30GB RAM Platforms : Hugging face, arXiv, other research articles Deployment tools : Gradio, Streamlit, Hugging face spaces hardware

Aplication demo

● Used google/owlv2-base-patch16-ensemble as zero shot object detection model and Qwen/Qwen2-VL-2B-Instruct as VLM ● Gradio framework and transformers library for development Demo Link - Hugging Face Space Google Colab demo HF space has 2vCPU-16GB RAM and no GPU deployed in the free tier. So the inference speed of our chatbot is very slow. To see the demo it is highly recommended to use the Google colab demo we have provided. Get started with the demo with minimal efforts. The inference speed increases drastically on google colab with T4 GPU runtime.

Image Demos

Live Links

Hugging Face Spaces: Conversational Image Recognition Chatbot
Note: Due to GPU constraints on Hugging Face Spaces, performance may be slower.
Google Colab Notebook: Google Colab Demo
Recommended: It is recommended to use the Google Colab Notebook for the demo because Hugging Face Spaces have GPU constraints so the application may run slow

Technologies Used

Programming languages : Python
Libraries : Transformers, Pytorch, Image libraries like PIL
Hardware : Nvidia T4 medium 8vCPU 30GB RAM
Platforms : Hugging face, arXiv, other research articles
Deployment tools : Gradio, Streamlit, Hugging face spaces hardware