reach-vb's picture
reach-vb HF staff
Update README.md
326fc48 verified
|
raw
history blame
4.3 kB
metadata
license: apache-2.0

DETR-Resnet50 (semantic segmentation) Core ML Models

See the Files tab for converted models.

DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper End-to-End Object Detection with Transformers by Carion et al. and first released in this repository.

Disclaimer: The team releasing DETR did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

The DETR model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for the class labels and a MLP (multi-layer perceptron) for the bounding boxes. The model uses so-called object queries to detect objects in an image. Each object query looks for a particular object in the image. For COCO, the number of object queries is set to 100.

The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.

model image

Evaluation - Variants

Variant Parameters Size (MB) Weight Precision Act. Precision IoU Pixel accuracy
facebook/detr-resnet-50-panoptic (PyTorch) 43M 172 Float32 Float32 0.393 0.746
DETRResnet50SemanticSegmentationF32 43M 171 Float32 Float32 0.393 0.746
DETRResnet50SemanticSegmentationF16 43M 86 Float16 Float16 0.395 0.746

IoU and Pixel accuracy measured on 512 images from the COCO dataset. The ground truth labels were extracted from the panoptic segmentation annotations, transformed to semantic segmentation masks. Input images were resized so that the smaller edge equals 448, then center-cropped.

Inference time

The following results refer to DETRResnet50SemanticSegmentationF16. The compute units for MacBook Pro (M1 Max) were manually selected to "CPU and Neural Engine".

Device OS Inference time (ms) Dominant compute unit
iPhone 15 Pro Max 17.5 40 Neural Engine
MacBook Pro (M1 Max) 14.5 43 Neural Engine
iPhone 12 Pro Max 18.0 52 Neural Engine
MacBook Pro (M3 Max) 15.0 29 Neural Engine

Download

Install huggingface-hub

pip install huggingface-hub

To download one of the .mlpackage folders to the models directory:

huggingface-cli download \
  --local-dir models --local-dir-use-symlinks False \
  coreml-projects/detr-resnet50-semantic-segmentation \
  --include "detr-resnet50-semantic-400-float16.mlpackage/*"

To download everything, skip the --include argument.