ViTamin Family
Designing Scalable Vision Models in the Vision-language Era. The best performing model is 'jienengchen/ViTamin-XL-384px'.
Feature Extraction • Updated • 93 • 18Note ViTamin-XL, with only 436M parameters and trained on the public DataComp-1B dataset, achieves an impressive 82.9% 🔥 zero-shot ImageNet accuracy.
jienengchen/ViTamin-L-336px
Feature Extraction • Updated • 10 • 4Note ViTamin-L, with 333M parameters, sets a new SOTA 🔥 across seven benchmarks for open-vocabulary segmentation, and also push forward the capabilities of large multi-modal models 🌋 significantly.
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Paper • 2404.02132 • Published • 2jienengchen/ViTamin-XL-336px
Feature Extraction • Updated • 8 • 1jienengchen/ViTamin-XL-256px
Feature Extraction • Updated • 10jienengchen/ViTamin-L2-384px
Feature Extraction • Updated • 11jienengchen/ViTamin-L2-336px
Feature Extraction • Updated • 13jienengchen/ViTamin-L2-256px
Feature Extraction • Updated • 10jienengchen/ViTamin-L-384px
Feature Extraction • Updated • 15 • 1jienengchen/ViTamin-L-256px
Feature Extraction • Updated • 7jienengchen/ViTamin-L-224px
Feature Extraction • Updated • 12
jienengchen/ViTamin-B-LTT
Feature Extraction • Updated • 10Note achieves 70.8% zero-shot ImageNet accuracy with 88M parameters.
jienengchen/ViTamin-S-LTT
Feature Extraction • Updated • 9Note achieves 63.4% zero-shot ImageNet accuracy with 22M parameters.
jienengchen/ViTamin-B
Feature Extraction • Updated • 13Note achieves 68.9% zero-shot ImageNet accuracy with 88M parameters.
jienengchen/ViTamin-S
Feature Extraction • Updated • 17Note achieves 62.2% zero-shot ImageNet accuracy with 22M parameters.
jienengchen/ViTamin-L2-224px
Feature Extraction • Updated • 9