metadata

license: apache-2.0

DETR-Resnet50 (semantic segmentation) Core ML Models

See the Files tab for converted models.

DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper End-to-End Object Detection with Transformers by Carion et al. and first released in this repository.

Disclaimer: The team releasing DETR did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

The DETR model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for the class labels and a MLP (multi-layer perceptron) for the bounding boxes. The model uses so-called object queries to detect objects in an image. Each object query looks for a particular object in the image. For COCO, the number of object queries is set to 100.

The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.

Evaluation - Variants

Variant	Parameters	Size (MB)	Weight Precision	Act. Precision	IoU	Pixel accuracy
facebook/detr-resnet-50-panoptic (PyTorch)	43M	172	Float32	Float32	0.393	0.746
DETRResnet50SemanticSegmentationF32	43M	171	Float32	Float32	0.393	0.746
DETRResnet50SemanticSegmentationF16	43M	86	Float16	Float16	0.395	0.746

IoU and Pixel accuracy measured on 512 images from the COCO dataset. The ground truth labels were extracted from the panoptic segmentation annotations, transformed to semantic segmentation masks. Input images were resized so that the smaller edge equals 448, then center-cropped.

Inference time

The following results refer to DETRResnet50SemanticSegmentationF16. The compute units for MacBook Pro (M1 Max) were manually selected to "CPU and Neural Engine".

Device	OS	Inference time (ms)	Dominant compute unit
iPhone 15 Pro Max	17.5	40	Neural Engine
MacBook Pro (M1 Max)	14.5	43	Neural Engine
iPhone 12 Pro Max	18.0	52	Neural Engine
MacBook Pro (M3 Max)	15.0	29	Neural Engine

Download

Install huggingface-hub

pip install huggingface-hub

To download one of the .mlpackage folders to the models directory:

huggingface-cli download \
  --local-dir models --local-dir-use-symlinks False \
  coreml-projects/detr-resnet50-semantic-segmentation \
  --include "detr-resnet50-semantic-400-float16.mlpackage/*"

To download everything, skip the --include argument.