DOSOD
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Yonghao He^1,*,🌟, Hu Su^2,*,📧, Haiyong Yu^1,*, Cong Yang³, Wei Sui¹, Cong Wang¹, Song Liu^4,📧

* Equal contribution, 🌟 Project lead, 📧 Corresponding author

¹ D-Robotics,
² State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,
³ BeeLab, School of Future Science and Engineering, Soochow University,
⁴ the School of Information Science and Technology, ShanghaiTech University

1. Introduction

1.1 Brief Introduction of DOSOD

Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection.

2. Model Overview

Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the LVIS minival and COCO val2017. All pre-trained models are released.

2.1 Zero-shot Evaluation on LVIS minival

model	Pre-train Data	Size	AP^mini	AP_r	AP_c	AP_f	weights
YOLO-Worldv1-S (repo)	O365+GoldG	640	24.3	16.6	22.1	27.7	HF Checkpoints 🤗
YOLO-Worldv1-M (repo)	O365+GoldG	640	28.6	19.7	26.6	31.9	HF Checkpoints 🤗
YOLO-Worldv1-L (repo)	O365+GoldG	640	32.5	22.3	30.6	36.1	HF Checkpoints 🤗
YOLO-Worldv1-S (paper)	O365+GoldG	640	26.2	19.1	23.6	29.8	HF Checkpoints 🤗
YOLO-Worldv1-M (paper)	O365+GoldG	640	31.0	23.8	29.2	33.9	HF Checkpoints 🤗
YOLO-Worldv1-L (paper)	O365+GoldG	640	35.0	27.1	32.8	38.3	HF Checkpoints 🤗
YOLO-Worldv2-S	O365+GoldG	640	22.7	16.3	20.8	25.5	HF Checkpoints 🤗
YOLO-Worldv2-M	O365+GoldG	640	30.0	25.0	27.2	33.4	HF Checkpoints 🤗
YOLO-Worldv2-L	O365+GoldG	640	33.0	22.6	32.0	35.8	HF Checkpoints 🤗
DOSOD-S	O365+GoldG	640	26.7	19.9	25.1	29.3	HF Checkpoints 🤗
DOSOD-M	O365+GoldG	640	31.3	25.7	29.6	33.7	HF Checkpoints 🤗
DOSOD-L	O365+GoldG	640	34.4	29.1	32.6	36.6	HF Checkpoints 🤗

NOTE: The results of YOLO-Worldv1 from repo and paper are different.

2.2 Zero-shot Inference on COCO dataset

model	Pre-train Data	Size	AP	AP₅₀	AP₇₅
YOLO-Worldv1-S (paper)	O365+GoldG	640	37.6	52.3	40.7
YOLO-Worldv1-M (paper)	O365+GoldG	640	42.8	58.3	46.4
YOLO-Worldv1-L (paper)	O365+GoldG	640	44.4	59.8	48.3
YOLO-Worldv2-S	O365+GoldG	640	37.5	52.0	40.7
YOLO-Worldv2-M	O365+GoldG	640	42.8	58.2	46.7
YOLO-Worldv2-L	O365+GoldG	640	45.4	61.0	49.4
DOSOD-S	O365+GoldG	640	36.1	51.0	39.1
DOSOD-M	O365+GoldG	640	41.7	57.1	45.2
DOSOD-L	O365+GoldG	640	44.6	60.5	48.4

2.3 Latency On RTX 4090

We utilize the tool of trtexec in TensorRT 8.6.1.6 to assess the latency in FP16 mode. All models are re-parameterized with 80 categories from COCO. Log info can be found by clicking the FPS.

model	Params	FPS
YOLO-Worldv1-S	13.32M	1007
YOLO-Worldv1-M	28.93M	702
YOLO-Worldv1-L	47.38M	494
YOLO-Worldv2-S	12.66M	1221
YOLO-Worldv2-M	28.20M	771
YOLO-Worldv2-L	46.62M	553
DOSOD-S	11.48M	1582
DOSOD-M	26.31M	922
DOSOD-L	44.19M	632

NOTE: FPS = 1000 / GPU Compute Time[mean]

2.4 Latency On RDK X5

We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of D-Robotics RDK X5. The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.

model	FPS (1 thread)	FPS (8 threads)
YOLO-Worldv2-S (INT16/INT8)	5.962/11.044	6.386/12.590
YOLO-Worldv2-M (INT16/INT8)	4.136/7.290	4.340/7.930
YOLO-Worldv2-L (INT16/INT8)	2.958/5.377	3.060/5.720
DOSOD-S (INT16/INT8)	12.527/31.020	14.657/47.328
DOSOD-M (INT16/INT8)	8.531/20.238	9.471/26.36
DOSOD-L (INT16/INT8)	5.663/12.799	6.069/14.939

DOSOD A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space