D-Robotics
/

DOSOD

Model card Files Files and versions Community

DOSOD / README.md

YonghaoHe

Update README.md

0ee8f60 verified 19 days ago

preview code

raw

history blame contribute delete

11.1 kB

	<div align="center">
	<br>
	<h1>DOSOD<br>
	A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
	</h1>
	<br>
	<a href="https://github.com/YonghaoHe">Yonghao He</a><sup><span>1,*,🌟 </span></sup>,
	<a href="https://people.ucas.edu.cn/~suhu">Hu Su</a><sup><span>2,*,📧</span></sup>,
	<a href="https://github.com/HarveyYesan">Haiyong Yu</a><sup><span>1,*</span></sup>,
	<a href="https://cong-yang.github.io/">Cong Yang</a><sup><span>3</span></sup>,
	<a href="">Wei Sui</a><sup><span>1</span></sup>,
	<a href="">Cong Wang</a><sup><span>1</span></sup>,
	<a href="www.amnrlab.org">Song Liu</a><sup><span>4,📧</span></sup>
	<br>

	\* Equal contribution, 🌟 Project lead, 📧 Corresponding author

	<sup>1</sup> D-Robotics, <br>
	<sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br>
	<sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br>
	<sup>4</sup> the School of Information Science and Technology, ShanghaiTech
	University

	[![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2412.14680)
	[![license](https://img.shields.io/badge/License-GPLv3.0-blue)](LICENSE)
	</div>
	</div>

	## 1. Introduction

	### 1.1 Brief Introduction of DOSOD

	Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World,
	open-vocabulary detection has been extensively applied in various scenarios.
	Real-time open-vocabulary detection has attracted significant attention.
	In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a
	practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems.
	Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector.
	A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space,
	within which the detector learns the region representations of class-agnostic proposals.
	Cross-modality features are directly aligned in the joint space,
	avoiding the complex feature interactions and thereby improving computational efficiency.
	DOSOD functions like a traditional closed-set detector during the testing phase,
	effectively bridging the gap between closed-set and open-set detection.

	## 2. Model Overview

	Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the `LVIS minival` and `COCO val2017`.
	All pre-trained models are released.

	### 2.1 Zero-shot Evaluation on LVIS minival

	<div><font size=2>

	\| model \| Pre-train Data \| Size \| AP<sup>mini</sup> \| AP<sub>r</sub> \| AP<sub>c</sub> \| AP<sub>f</sub> \| weights \|
	\|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:\|:---------------\|:-----\|:-----------------:\|:--------------:\|:--------------:\|:--------------:\|:----------------------------------------------------------------------------------------------------------------------------------:\|
	\| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(repo)</div> \| O365+GoldG \| 640 \| 24.3 \| 16.6 \| 22.1 \| 27.7 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) \|
	\| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(repo)</div> \| O365+GoldG \| 640 \| 28.6 \| 19.7 \| 26.6 \| 31.9 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) \|
	\| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(repo)</div> \| O365+GoldG \| 640 \| 32.5 \| 22.3 \| 30.6 \| 36.1 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) \|
	\| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div> \| O365+GoldG \| 640 \| 26.2 \| 19.1 \| 23.6 \| 29.8 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) \|
	\| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div> \| O365+GoldG \| 640 \| 31.0 \| 23.8 \| 29.2 \| 33.9 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) \|
	\| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div> \| O365+GoldG \| 640 \| 35.0 \| 27.1 \| 32.8 \| 38.3 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) \|
	\| [YOLO-Worldv2-S]() \| O365+GoldG \| 640 \| 22.7 \| 16.3 \| 20.8 \| 25.5 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) \|
	\| [YOLO-Worldv2-M]() \| O365+GoldG \| 640 \| 30.0 \| 25.0 \| 27.2 \| 33.4 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) \|
	\| [YOLO-Worldv2-L]() \| O365+GoldG \| 640 \| 33.0 \| 22.6 \| 32.0 \| 35.8 \| [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) \|
	\| [DOSOD-S]() \| O365+GoldG \| 640 \| 26.7 \| 19.9 \| 25.1 \| 29.3 \| [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_s.pth) \|
	\| [DOSOD-M]() \| O365+GoldG \| 640 \| 31.3 \| 25.7 \| 29.6 \| 33.7 \| [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_m.pth) \|
	\| [DOSOD-L]() \| O365+GoldG \| 640 \| 34.4 \| 29.1 \| 32.6 \| 36.6 \| [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_l.pth) \|

	> NOTE: The results of YOLO-Worldv1 from repo and [paper](https://arxiv.org/abs/2401.17270) are different.

	</font>
	</div>

	### 2.2 Zero-shot Inference on COCO dataset

	<div><font size=2>

	\| model \| Pre-train Data \| Size \| AP \| AP<sub>50</sub> \| AP<sub>75</sub> \|
	\|:--------------------------------------------------------------------------------------------------------------------:\|:---------------\|:-----\|:----:\|:---------------:\|:---------------:\|
	\| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div> \| O365+GoldG \| 640 \| 37.6 \| 52.3 \| 40.7 \|
	\| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div> \| O365+GoldG \| 640 \| 42.8 \| 58.3 \| 46.4 \|
	\| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div> \| O365+GoldG \| 640 \| 44.4 \| 59.8 \| 48.3 \|
	\| [YOLO-Worldv2-S]() \| O365+GoldG \| 640 \| 37.5 \| 52.0 \| 40.7 \|
	\| [YOLO-Worldv2-M]() \| O365+GoldG \| 640 \| 42.8 \| 58.2 \| 46.7 \|
	\| [YOLO-Worldv2-L]() \| O365+GoldG \| 640 \| 45.4 \| 61.0 \| 49.4 \|
	\| [DOSOD-S]() \| O365+GoldG \| 640 \| 36.1 \| 51.0 \| 39.1 \|
	\| [DOSOD-M]() \| O365+GoldG \| 640 \| 41.7 \| 57.1 \| 45.2 \|
	\| [DOSOD-L]() \| O365+GoldG \| 640 \| 44.6 \| 60.5 \| 48.4 \|

	</font>
	</div>

	### 2.3 Latency On RTX 4090

	We utilize the tool of `trtexec` in [TensorRT 8.6.1.6](https://developer.nvidia.com/tensorrt) to assess the latency in FP16 mode.
	All models are re-parameterized with 80 categories from COCO.
	Log info can be found by clicking the FPS.

	\| model \| Params \| FPS \|
	\|:--------------:\|:------:\|:---------------------------------------:\|
	\| YOLO-Worldv1-S \| 13.32M \| 1007 \|
	\| YOLO-Worldv1-M \| 28.93M \| 702 \|
	\| YOLO-Worldv1-L \| 47.38M \| 494 \|
	\| YOLO-Worldv2-S \| 12.66M \| 1221 \|
	\| YOLO-Worldv2-M \| 28.20M \| 771 \|
	\| YOLO-Worldv2-L \| 46.62M \| 553 \|
	\| DOSOD-S \| 11.48M \| 1582 \|
	\| DOSOD-M \| 26.31M \| 922 \|
	\| DOSOD-L \| 44.19M \| 632 \|

	> NOTE: FPS = 1000 / GPU Compute Time[mean]

	### 2.4 Latency On RDK X5

	We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of [D-Robotics RDK X5](https://d-robotics.cc/rdkx5).
	The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.

	\| model \| FPS (1 thread) \| FPS (8 threads) \|
	\|:-------------------------------:\|:--------------:\|:---------------:\|
	\| YOLO-Worldv2-S<br/>(INT16/INT8) \| 5.962/11.044 \| 6.386/12.590 \|
	\| YOLO-Worldv2-M<br/>(INT16/INT8) \| 4.136/7.290 \| 4.340/7.930 \|
	\| YOLO-Worldv2-L<br/>(INT16/INT8) \| 2.958/5.377 \| 3.060/5.720 \|
	\| DOSOD-S<br/>(INT16/INT8) \| 12.527/31.020 \| 14.657/47.328 \|
	\| DOSOD-M<br/>(INT16/INT8) \| 8.531/20.238 \| 9.471/26.36 \|
	\| DOSOD-L<br/>(INT16/INT8) \| 5.663/12.799 \| 6.069/14.939 \|