DOSOD
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
Yonghao He1,*,🌟 ,
Hu Su2,*,📧,
Haiyong Yu1,*,
Cong Yang3,
Wei Sui1,
Cong Wang1,
Song Liu4,📧
\* Equal contribution, 🌟 Project lead, 📧 Corresponding author
1 D-Robotics,
2 State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,
3 BeeLab, School of Future Science and Engineering, Soochow University,
4 the School of Information Science and Technology, ShanghaiTech
University
[![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2412.14680)
[![license](https://img.shields.io/badge/License-GPLv3.0-blue)](LICENSE)
## 1. Introduction
### 1.1 Brief Introduction of DOSOD
Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World,
open-vocabulary detection has been extensively applied in various scenarios.
Real-time open-vocabulary detection has attracted significant attention.
In our paper, Decoupled Open-Set Object Detection (**DOSOD**) is proposed as a
practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems.
Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector.
A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space,
within which the detector learns the region representations of class-agnostic proposals.
Cross-modality features are directly aligned in the joint space,
avoiding the complex feature interactions and thereby improving computational efficiency.
DOSOD functions like a traditional closed-set detector during the testing phase,
effectively bridging the gap between closed-set and open-set detection.
## 2. Model Overview
Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the `LVIS minival` and `COCO val2017`.
All pre-trained models are released.
### 2.1 Zero-shot Evaluation on LVIS minival