pages/VITMAE/VITMAE.md · merve/vision

Just read VitMAE paper, sharing some highlights 🧶 ViTMAE is a simply yet effective self-supervised pre-training technique, where authors combined vision transformer with masked autoencoder.
The images are first masked (75 percent of the image!) and then the model tries to learn about the features through trying to reconstruct the original image!

The image is not masked, but rather only the visible patches are fed to the encoder (and that is the only thing encoder sees!).
Next, a mask token is added to where the masked patches are (a bit like BERT, if you will) and the mask tokens and encoded patches are fed to decoder.
The decoder then tries to reconstruct the original image.

As a result, the authors found out that high masking ratio works well in fine-tuning for downstream tasks and linear probing 🤯🤯

If you want to try the model or fine-tune, all the pre-trained VITMAE models released released by Meta are available on Huggingface.
We've built a demo for you to see the intermediate outputs and reconstruction by VITMAE.

Also there's a nice notebook by @NielsRogge.

Ressources:
Masked Autoencoders Are Scalable Vision Learners by LKaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick (2021) GitHub Hugging Face documentation

Original tweet (December 29, 2023)