Update README.md
Browse files
README.md
CHANGED
@@ -8,10 +8,6 @@ datasets:
|
|
8 |
---
|
9 |
|
10 |
# Vision Transformer (base-sized model) - Hybrid
|
11 |
-
| ![Pull figure](https://s3.amazonaws.com/moonup/production/uploads/1670350379252-62441d1d9fdefb55a0b7d12c.png) |
|
12 |
-
|:--:|
|
13 |
-
| <b> Figure 1 from the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) describing the model's architecture </b>|
|
14 |
-
|
15 |
|
16 |
The hybrid Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures. ViT hybrid is a slight variant of the [plain Vision Transformer](vit), by leveraging a convolutional backbone (specifically, [BiT](bit)) whose features are used as initial "tokens" for the Transformer.
|
17 |
|
|
|
8 |
---
|
9 |
|
10 |
# Vision Transformer (base-sized model) - Hybrid
|
|
|
|
|
|
|
|
|
11 |
|
12 |
The hybrid Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures. ViT hybrid is a slight variant of the [plain Vision Transformer](vit), by leveraging a convolutional backbone (specifically, [BiT](bit)) whose features are used as initial "tokens" for the Transformer.
|
13 |
|