Spaces:
Running
Read the MobileSAM paper this weekend 📖 Sharing some insights!
The idea 💡: SAM model consist of three parts, a heavy image encoder, a prompt encoder (prompt can be text, bounding box, mask or point) and a mask decoder.
To make the SAM model smaller without compromising from the performance, the authors looked into three types of distillation.
First one is distilling the decoder outputs directly (a more naive approach) with a completely randomly initialized small ViT and randomly initialized mask decoder.
However, when the ViT and the decoder are both in a bad state, this doesn't work well.
The second type of distillation is called semi-coupled, where the authors only randomly initialized the ViT image encoder and kept the mask decoder. This is called semi-coupled because the image encoder distillation still depends on the mask decoder (see below 👇 )
The last type of distillation, decoupled distillation, is the most intuitive IMO. The authors have "decoupled" image encoder altogether and have frozen the mask decoder and didn't really distill based on generated masks. This makes sense as the bottleneck here is the encoder itself and most of the time, distillation works well with encoding.
Finally, they found out that decoupled distillation performs better than coupled distillation by means of mean IoU and requires much less compute! ♥️
Wanted to leave some links here if you'd like to try yourself 👇
- MobileSAM demo
- Model repository
If you'd like to experiment around TinyViT, timm library has a bunch of checkpoints available.
Ressources:
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications by Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, Choong Seon Hong (2023) GitHub
Original tweet (December 24, 2023)