DOSOD
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space


Yonghao He1,*,🌟 , Hu Su2,*,📧, Haiyong Yu1,*, Cong Yang3, Wei Sui1, Cong Wang1, Song Liu4,📧
\* Equal contribution, 🌟 Project lead, 📧 Corresponding author 1 D-Robotics,
2 State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,
3 BeeLab, School of Future Science and Engineering, Soochow University,
4 the School of Information Science and Technology, ShanghaiTech University [![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2412.14680) [![license](https://img.shields.io/badge/License-GPLv3.0-blue)](LICENSE)
## 1. Introduction ### 1.1 Brief Introduction of DOSOD Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (**DOSOD**) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. ## 2. Model Overview Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the `LVIS minival` and `COCO val2017`. All pre-trained models are released. ### 2.1 Zero-shot Evaluation on LVIS minival
| model | Pre-train Data | Size | APmini | APr | APc | APf | weights | |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:| |
[YOLO-Worldv1-S]()
(repo)
| O365+GoldG | 640 | 24.3 | 16.6 | 22.1 | 27.7 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) | |
[YOLO-Worldv1-M]()
(repo)
| O365+GoldG | 640 | 28.6 | 19.7 | 26.6 | 31.9 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | |
[YOLO-Worldv1-L]()
(repo)
| O365+GoldG | 640 | 32.5 | 22.3 | 30.6 | 36.1 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) | |
[YOLO-Worldv1-S]()
(paper)
| O365+GoldG | 640 | 26.2 | 19.1 | 23.6 | 29.8 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) | |
[YOLO-Worldv1-M]()
(paper)
| O365+GoldG | 640 | 31.0 | 23.8 | 29.2 | 33.9 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | |
[YOLO-Worldv1-L]()
(paper)
| O365+GoldG | 640 | 35.0 | 27.1 | 32.8 | 38.3 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) | | [YOLO-Worldv2-S]() | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) | | [YOLO-Worldv2-M]() | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | | [YOLO-Worldv2-L]() | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) | | [DOSOD-S]() | O365+GoldG | 640 | 26.7 | 19.9 | 25.1 | 29.3 | [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_s.pth) | | [DOSOD-M]() | O365+GoldG | 640 | 31.3 | 25.7 | 29.6 | 33.7 | [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_m.pth) | | [DOSOD-L]() | O365+GoldG | 640 | 34.4 | 29.1 | 32.6 | 36.6 | [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_l.pth) | > NOTE: The results of YOLO-Worldv1 from repo and [paper](https://arxiv.org/abs/2401.17270) are different.
### 2.2 Zero-shot Inference on COCO dataset
| model | Pre-train Data | Size | AP | AP50 | AP75 | |:--------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:----:|:---------------:|:---------------:| |
[YOLO-Worldv1-S]()
(paper)
| O365+GoldG | 640 | 37.6 | 52.3 | 40.7 | |
[YOLO-Worldv1-M]()
(paper)
| O365+GoldG | 640 | 42.8 | 58.3 | 46.4 | |
[YOLO-Worldv1-L]()
(paper)
| O365+GoldG | 640 | 44.4 | 59.8 | 48.3 | | [YOLO-Worldv2-S]() | O365+GoldG | 640 | 37.5 | 52.0 | 40.7 | | [YOLO-Worldv2-M]() | O365+GoldG | 640 | 42.8 | 58.2 | 46.7 | | [YOLO-Worldv2-L]() | O365+GoldG | 640 | 45.4 | 61.0 | 49.4 | | [DOSOD-S]() | O365+GoldG | 640 | 36.1 | 51.0 | 39.1 | | [DOSOD-M]() | O365+GoldG | 640 | 41.7 | 57.1 | 45.2 | | [DOSOD-L]() | O365+GoldG | 640 | 44.6 | 60.5 | 48.4 |
### 2.3 Latency On RTX 4090 We utilize the tool of `trtexec` in [TensorRT 8.6.1.6](https://developer.nvidia.com/tensorrt) to assess the latency in FP16 mode. All models are re-parameterized with 80 categories from COCO. Log info can be found by clicking the FPS. | model | Params | FPS | |:--------------:|:------:|:---------------------------------------:| | YOLO-Worldv1-S | 13.32M | 1007 | | YOLO-Worldv1-M | 28.93M | 702 | | YOLO-Worldv1-L | 47.38M | 494 | | YOLO-Worldv2-S | 12.66M | 1221 | | YOLO-Worldv2-M | 28.20M | 771 | | YOLO-Worldv2-L | 46.62M | 553 | | DOSOD-S | 11.48M | 1582 | | DOSOD-M | 26.31M | 922 | | DOSOD-L | 44.19M | 632 | > NOTE: FPS = 1000 / GPU Compute Time[mean] ### 2.4 Latency On RDK X5 We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of [D-Robotics RDK X5](https://d-robotics.cc/rdkx5). The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes. | model | FPS (1 thread) | FPS (8 threads) | |:-------------------------------:|:--------------:|:---------------:| | YOLO-Worldv2-S
(INT16/INT8) | 5.962/11.044 | 6.386/12.590 | | YOLO-Worldv2-M
(INT16/INT8) | 4.136/7.290 | 4.340/7.930 | | YOLO-Worldv2-L
(INT16/INT8) | 2.958/5.377 | 3.060/5.720 | | DOSOD-S
(INT16/INT8) | 12.527/31.020 | 14.657/47.328 | | DOSOD-M
(INT16/INT8) | 8.531/20.238 | 9.471/26.36 | | DOSOD-L
(INT16/INT8) | 5.663/12.799 | 6.069/14.939 |