SimMIM: A Simple Framework for Masked Image Modeling

This repository is primarily used for storing SimMIM pretrained Swin-V2 models, which are utilized in the "On Data Scaling in Masked Image Modeling" study. If you have any questions about SimMIM or the Data Scaling study, please file an issue in this repository or contact xie.zda@outlook.com directly. Please note that the SimMIM and Swin-Transformer repositories managed by Microsoft are no longer within my scope.

SimMIM Pretrained Swin-V2 Models

You can use the direct link below to download the checkpoints, or use the huggingface_hub library to download checkpoints using Python.

Model size only includes the backbone weights and excludes weights in the decoders/classification heads.
Batch size for all models is set to 2048.
Validation loss is calculated on the ImageNet-1K validation set.
Fine-tuned acc@1 refers to the top-1 accuracy on the ImageNet-1K validation set after fine-tuning.

name	model size	pre-train dataset	pre-train iterations	validation loss	fine-tuned acc@1	pre-trained model	fine-tuned model
SwinV2-Small	49M	ImageNet-1K 10%	125k	0.4820	82.69	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 10%	250k	0.4961	83.11	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 10%	500k	0.5115	83.17	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 20%	125k	0.4751	83.05	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 20%	250k	0.4722	83.56	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 20%	500k	0.4734	83.75	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 50%	125k	0.4732	83.04	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 50%	250k	0.4681	83.67	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 50%	500k	0.4646	83.96	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K	125k	0.4728	82.92	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K	250k	0.4674	83.66	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K	500k	0.4641	84.08	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 10%	125k	0.4822	83.33	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 10%	250k	0.4997	83.60	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 10%	500k	0.5112	83.41	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 20%	125k	0.4703	83.86	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 20%	250k	0.4679	84.37	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 20%	500k	0.4711	84.61	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 50%	125k	0.4683	84.04	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 50%	250k	0.4633	84.57	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 50%	500k	0.4598	84.95	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K	125k	0.4680	84.13	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K	250k	0.4626	84.65	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K	500k	0.4588	85.04	huggingface	huggingface
SwinV2-Base	87M	ImageNet-22K	125k	0.4695	84.11	huggingface	huggingface
SwinV2-Base	87M	ImageNet-22K	250k	0.4649	84.57	huggingface	huggingface
SwinV2-Base	87M	ImageNet-22K	500k	0.4614	85.11	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 10%	125k	0.4995	83.69	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 10%	250k	0.5140	83.66	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 10%	500k	0.5150	83.50	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 20%	125k	0.4675	84.38	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 20%	250k	0.4746	84.71	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 20%	500k	0.4960	84.59	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 50%	125k	0.4622	84.78	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 50%	250k	0.4566	85.38	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 50%	500k	0.4530	85.80	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K	125k	0.4611	84.98	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K	250k	0.4552	85.45	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K	500k	0.4507	85.91	huggingface	huggingface
SwinV2-Large	195M	ImageNet-22K	125k	0.4649	84.61	huggingface	huggingface
SwinV2-Large	195M	ImageNet-22K	250k	0.4586	85.39	huggingface	huggingface
SwinV2-Large	195M	ImageNet-22K	500k	0.4536	85.81	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 20%	125k	0.4789	84.35	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 20%	250k	0.5038	84.16	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 20%	500k	0.5071	83.44	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 50%	125k	0.4549	85.09	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 50%	250k	0.4511	85.64	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 50%	500k	0.4559	85.69	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K	125k	0.4531	85.23	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K	250k	0.4464	85.90	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K	500k	0.4416	86.34	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-22K	125k	0.4564	85.14	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-22K	250k	0.4499	85.86	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-22K	500k	0.4444	86.27	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K 50%	125k	0.4534	85.44	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K 50%	250k	0.4515	85.76	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K 50%	500k	0.4719	85.51	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K	125k	0.4513	85.57	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K	250k	0.4442	86.12	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K	500k	0.4395	86.46	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-22K	125k	0.4544	85.39	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-22K	250k	0.4475	85.96	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-22K	500k	0.4416	86.53	huggingface	huggingface

Citations

Citing SimMIM

@inproceedings{xie2021simmim,
  title={SimMIM: A Simple Framework for Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Citing "On Data Scaling in Masked Image Modeling"

@article{xie2022data,
  title={On Data Scaling in Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Wei, Yixuan and Dai, Qi and Hu, Han},
  journal={arXiv preprint arXiv:2206.04664},
  year={2022}
}

Citing Swin V2

@inproceedings{liu2021swinv2,
  title={Swin Transformer V2: Scaling Up Capacity and Resolution}, 
  author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}