DOSOD
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

<div align="center">
<br>
<h1>DOSOD<br>
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
</h1>
<br>
<a href="https://github.com/YonghaoHe">Yonghao He</a><sup><span>1,*,🌟 </span></sup>, 
<a href="https://people.ucas.edu.cn/~suhu">Hu Su</a><sup><span>2,*,📧</span></sup>,
<a href="https://github.com/HarveyYesan">Haiyong Yu</a><sup><span>1,*</span></sup>,
<a href="https://cong-yang.github.io/">Cong Yang</a><sup><span>3</span></sup>,
<a href="">Wei Sui</a><sup><span>1</span></sup>,
<a href="">Cong Wang</a><sup><span>1</span></sup>,
<a href="www.amnrlab.org">Song Liu</a><sup><span>4,📧</span></sup>
<br>

\* Equal contribution, 🌟 Project lead, 📧 Corresponding author

<sup>1</sup> D-Robotics, <br>
<sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br>
<sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br>
<sup>4</sup> the School of Information Science and Technology, ShanghaiTech
University

[![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2412.14680)
[![license](https://img.shields.io/badge/License-GPLv3.0-blue)](LICENSE)
</div>
</div>

## 1. Introduction

### 1.1 Brief Introduction of DOSOD

Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World,
open-vocabulary detection has been extensively applied in various scenarios.
Real-time open-vocabulary detection has attracted significant attention.
In our paper, Decoupled Open-Set Object Detection (**DOSOD**) is proposed as a
practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems.
Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector.
A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space,
within which the detector learns the region representations of class-agnostic proposals.
Cross-modality features are directly aligned in the joint space,
avoiding the complex feature interactions and thereby improving computational efficiency.
DOSOD functions like a traditional closed-set detector during the testing phase,
effectively bridging the gap between closed-set and open-set detection.

## 2. Model Overview

Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the `LVIS minival` and `COCO val2017`.
All pre-trained models are released.

### 2.1 Zero-shot Evaluation on LVIS minival

<div><font size=2>

|                                                                                     model                                                                                      | Pre-train Data | Size | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> |                                                              weights                                                               |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:|
| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(repo)</div> | O365+GoldG     | 640  |       24.3        |      16.6      |      22.1      |      27.7      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(repo)</div> | O365+GoldG     | 640  |       28.6        |      19.7      |      26.6      |      31.9      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | 
| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(repo)</div> | O365+GoldG     | 640  |       32.5        |      22.3      |      30.6      |      36.1      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) | 
|                                                      <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       26.2        |      19.1      |      23.6      |      29.8      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
|                                                      <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       31.0        |      23.8      |      29.2      |      33.9      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | 
|                                                      <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       35.0        |      27.1      |      32.8      |      38.3      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
|                              [YOLO-Worldv2-S]()                              | O365+GoldG     | 640  |       22.7        |      16.3      |      20.8      |      25.5      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
|                              [YOLO-Worldv2-M]()                              | O365+GoldG     | 640  |       30.0        |      25.0      |      27.2      |      33.4      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | 
|                              [YOLO-Worldv2-L]()                              | O365+GoldG     | 640  |       33.0        |      22.6      |      32.0      |      35.8      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) | 
|                                     [DOSOD-S]()                                     | O365+GoldG     | 640  |       26.7        |      19.9      |      25.1      |      29.3      |                      [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_s.pth)                      |
|                                     [DOSOD-M]()                                     | O365+GoldG     | 640  |       31.3        |      25.7      |      29.6      |      33.7      |                              [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_m.pth)                                                                                                       | 
|                                     [DOSOD-L]()                                     | O365+GoldG     | 640  |       34.4        |      29.1      |      32.6      |      36.6      |                                       [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_l.pth)                                                                                              | 

> NOTE: The results of YOLO-Worldv1 from repo and [paper](https://arxiv.org/abs/2401.17270) are different.

</font>
</div>

### 2.2 Zero-shot Inference on COCO dataset

<div><font size=2>

|                                                        model                                                         | Pre-train Data | Size |  AP  | AP<sub>50</sub> | AP<sub>75</sub> | 
|:--------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:----:|:---------------:|:---------------:|
|                         <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div>                         | O365+GoldG     | 640  | 37.6 |      52.3       |      40.7       |
|                         <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div>                         | O365+GoldG     | 640  | 42.8 |      58.3       |      46.4       |
|                         <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div>                         | O365+GoldG     | 640  | 44.4 |      59.8       |      48.3       |
| [YOLO-Worldv2-S]() | O365+GoldG     | 640  | 37.5 |      52.0       |      40.7       |
| [YOLO-Worldv2-M]() | O365+GoldG     | 640  | 42.8 |      58.2       |      46.7       | 
| [YOLO-Worldv2-L]() | O365+GoldG     | 640  | 45.4 |      61.0       |      49.4       | 
|        [DOSOD-S]()        | O365+GoldG     | 640  | 36.1 |      51.0       |      39.1       |
|        [DOSOD-M]()        | O365+GoldG     | 640  | 41.7 |      57.1       |      45.2       | 
|        [DOSOD-L]()        | O365+GoldG     | 640  | 44.6 |      60.5       |      48.4       | 

</font>
</div>

### 2.3 Latency On RTX 4090

We utilize the tool of `trtexec` in [TensorRT 8.6.1.6](https://developer.nvidia.com/tensorrt) to assess the latency in FP16 mode.
All models are re-parameterized with 80 categories from COCO.
Log info can be found by clicking the FPS.

|     model      | Params |                   FPS                   |
|:--------------:|:------:|:---------------------------------------:|
| YOLO-Worldv1-S | 13.32M | 1007 |
| YOLO-Worldv1-M | 28.93M | 702  |
| YOLO-Worldv1-L | 47.38M | 494  |
| YOLO-Worldv2-S | 12.66M | 1221 |
| YOLO-Worldv2-M | 28.20M | 771 |
| YOLO-Worldv2-L | 46.62M | 553  |
|    DOSOD-S     | 11.48M |    1582    |
|    DOSOD-M     | 26.31M |     922     |
|    DOSOD-L     | 44.19M |     632     |

> NOTE: FPS = 1000 / GPU Compute Time[mean]

### 2.4 Latency On RDK X5

We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of [D-Robotics RDK X5](https://d-robotics.cc/rdkx5).
The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.

|              model              | FPS (1 thread) | FPS (8 threads) |
|:-------------------------------:|:--------------:|:---------------:|
| YOLO-Worldv2-S<br/>(INT16/INT8) |  5.962/11.044  |  6.386/12.590   |
| YOLO-Worldv2-M<br/>(INT16/INT8) |  4.136/7.290   |   4.340/7.930   |
| YOLO-Worldv2-L<br/>(INT16/INT8) |  2.958/5.377   |   3.060/5.720   |
|    DOSOD-S<br/>(INT16/INT8)     | 12.527/31.020  |  14.657/47.328  |
|    DOSOD-M<br/>(INT16/INT8)     |  8.531/20.238  |   9.471/26.36   |
|    DOSOD-L<br/>(INT16/INT8)     |  5.663/12.799  |  6.069/14.939   |