Spaces:
Running
Running
Just read VitMAE paper, sharing some highlights 🧶 ViTMAE is a simply yet effective self-supervised pre-training technique, where authors combined vision transformer with masked autoencoder. | |
The images are first masked (75 percent of the image!) and then the model tries to learn about the features through trying to reconstruct the original image! | |
![image_1](image_1.jpg) | |
The image is not masked, but rather only the visible patches are fed to the encoder (and that is the only thing encoder sees!). | |
Next, a mask token is added to where the masked patches are (a bit like BERT, if you will) and the mask tokens and encoded patches are fed to decoder. | |
The decoder then tries to reconstruct the original image. | |
![image_2](image_2.jpg) | |
As a result, the authors found out that high masking ratio works well in fine-tuning for downstream tasks and linear probing 🤯🤯 | |
![image_3](image_3.jpg) | |
If you want to try the model or fine-tune, all the pre-trained VITMAE models released released by Meta are available on [Huggingface](https://t.co/didvTL9Zkm). | |
We've built a [demo](https://t.co/PkuACJiKrB) for you to see the intermediate outputs and reconstruction by VITMAE. | |
Also there's a nice [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb) by [@NielsRogge](https://twitter.com/NielsRogge). | |
![image_4](image_4.jpg) | |
> [!TIP] | |
Ressources: | |
[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377v3) | |
by LKaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick (2021) | |
[GitHub](https://github.com/facebookresearch/mae) | |
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/vit_mae) | |
> [!NOTE] | |
[Original tweet](https://twitter.com/mervenoyann/status/1740688304784183664) (December 29, 2023) | |