lbourdois's picture
Upload 174 files
94e735e verified
|
raw
history blame
2.34 kB

Read the MobileSAM paper this weekend 📖 Sharing some insights!
The idea 💡: SAM model consist of three parts, a heavy image encoder, a prompt encoder (prompt can be text, bounding box, mask or point) and a mask decoder.
To make the SAM model smaller without compromising from the performance, the authors looked into three types of distillation.
First one is distilling the decoder outputs directly (a more naive approach) with a completely randomly initialized small ViT and randomly initialized mask decoder.
However, when the ViT and the decoder are both in a bad state, this doesn't work well.

image_1

The second type of distillation is called semi-coupled, where the authors only randomly initialized the ViT image encoder and kept the mask decoder. This is called semi-coupled because the image encoder distillation still depends on the mask decoder (see below 👇 )

image_2

The last type of distillation, decoupled distillation, is the most intuitive IMO. The authors have "decoupled" image encoder altogether and have frozen the mask decoder and didn't really distill based on generated masks. This makes sense as the bottleneck here is the encoder itself and most of the time, distillation works well with encoding.

image_3

Finally, they found out that decoupled distillation performs better than coupled distillation by means of mean IoU and requires much less compute! ♥️

image_4

Wanted to leave some links here if you'd like to try yourself 👇

image_5

Ressources:
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications by Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, Choong Seon Hong (2023) GitHub

Original tweet (December 24, 2023)