Vision Transformer

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. The weights were converted from the ViT-L_16.npz file in GCS buckets presented in the original repository.

Safetensors

Model size

327M params

Tensor type

F32

Inference API

Inference API (serverless) does not yet support transformers models for this pipeline type.