Tiny Test Models
I've recently trained a set of tiny test models (https://huggingface.co/collections/timm/timm-tiny-test-models-66f18bd70518277591a86cef) on ImageNet-1k covering several of the most popular architecture families.
It takes ~10 seconds to download all 13 pretrained weights and run one step of inference on each w/ a ratty old CPU (but fast internet connection). It will allow quick verification of model functionality, from pretrained weight download through every API feature of full size models. They differ from full size models in that they have lower default resolution and typically 1 block per stage, very narrow widths.
This is all well and good, but would anyone have any interest in these outside of tests? Well, this is where you come in. These are some of the smallest models that are decently trained on ImageNet-1k. They use a recent training recipe adapted from MobileNet-v4 (Conv-Small), a good recipe for squeezing accuracy from small models. The top-1 are by no means impressive, but the models do work well on fine-tune for small datasets, and I imagine they could work quite well for some reduced resource (embedded) applications or as part of reinforcement learning vision policies.
Let me know if you find any good applications for them outside of tests. Here's the summary of the model results, they were trained natively at 160x160, and most models see a small pickup at 192x192 by leveraging the train-test discrepancy.
ImageNet Accuracy
model | img_size | top1 | top5 | param_count | norm |
---|---|---|---|---|---|
test_vit3.r160_in1k | 192 | 58.116 | 81.876 | 0.93 | LN |
test_vit3.r160_in1k | 160 | 56.894 | 80.748 | 0.93 | LN |
test_convnext3.r160_in1k | 192 | 54.558 | 79.356 | 0.47 | LN |
test_convnext2.r160_in1k | 192 | 53.62 | 78.636 | 0.48 | LN |
test_convnext2.r160_in1k | 160 | 53.51 | 78.526 | 0.48 | LN |
test_convnext3.r160_in1k | 160 | 53.328 | 78.318 | 0.47 | LN |
test_convnext.r160_in1k | 192 | 48.532 | 74.944 | 0.27 | LN |
test_nfnet.r160_in1k | 192 | 48.298 | 73.446 | 0.38 | WS |
test_convnext.r160_in1k | 160 | 47.764 | 74.152 | 0.27 | LN |
test_nfnet.r160_in1k | 160 | 47.616 | 72.898 | 0.38 | WS |
test_efficientnet.r160_in1k | 192 | 47.164 | 71.706 | 0.36 | BN |
test_efficientnet_evos.r160_in1k | 192 | 46.924 | 71.53 | 0.36 | EVOS |
test_byobnet.r160_in1k | 192 | 46.688 | 71.668 | 0.46 | BN |
test_efficientnet_evos.r160_in1k | 160 | 46.498 | 71.006 | 0.36 | EVOS |
test_efficientnet.r160_in1k | 160 | 46.454 | 71.014 | 0.36 | BN |
test_byobnet.r160_in1k | 160 | 45.852 | 70.996 | 0.46 | BN |
test_efficientnet_ln.r160_in1k | 192 | 44.538 | 69.974 | 0.36 | LN |
test_efficientnet_gn.r160_in1k | 192 | 44.448 | 69.75 | 0.36 | GN |
test_efficientnet_ln.r160_in1k | 160 | 43.916 | 69.404 | 0.36 | LN |
test_efficientnet_gn.r160_in1k | 160 | 43.88 | 69.162 | 0.36 | GN |
test_vit2.r160_in1k | 192 | 43.454 | 69.798 | 0.46 | LN |
test_resnet.r160_in1k | 192 | 42.376 | 68.744 | 0.47 | BN |
test_vit2.r160_in1k | 160 | 42.232 | 68.982 | 0.46 | LN |
test_vit.r160_in1k | 192 | 41.984 | 68.64 | 0.37 | LN |
test_resnet.r160_in1k | 160 | 41.578 | 67.956 | 0.47 | BN |
test_vit.r160_in1k | 160 | 40.946 | 67.362 | 0.37 | LN |
Througput @ 160x160 w/ torch.compile, mode='max-autotune', PyTorch 2.4.1, RTX4090
model | infer_samples_per_sec | train_samples_per_sec |
---|---|---|
test_vit | 300560.67 | 87518.73 |
test_vit2 | 254514.84 | 70132.93 |
test_convnext | 216367.11 | 50905.24 |
test_convnext3 | 200783.46 | 49074.48 |
test_byobnet | 199426.55 | 49487.12 |
test_convnext2 | 196727.0 | 48119.64 |
test_efficientnet | 181404.48 | 43546.96 |
test_efficientnet_ln | 173432.33 | 33280.66 |
test_efficientnet_evos | 169177.92 | 39684.92 |
test_vit3 | 163786.54 | 44318.45 |
test_efficientnet_gn | 158421.02 | 44226.92 |
test_resnet | 153289.49 | 28341.52 |
test_nfnet | 80837.46 | 16907.38 |
Througput @ 160x160 w/ torch.compile, mode='reduce-overhead', PyTorch 2.4.1, RTX4090
model | infer_samples_per_sec | train_samples_per_sec |
---|---|---|
test_vit | 274007.61 | 86652.08 |
test_vit2 | 231651.39 | 68993.91 |
test_byobnet | 197767.6 | 48633.6 |
test_convnext | 184134.55 | 46879.08 |
test_efficientnet | 170239.18 | 42812.1 |
test_efficientnet_ln | 166604.2 | 31946.88 |
test_efficientnet_evos | 163667.41 | 42222.59 |
test_vit3 | 161792.13 | 45354.67 |
test_convnext2 | 160601.75 | 43187.22 |
test_convnext3 | 160494.65 | 44304.95 |
test_efficientnet_gn | 155447.85 | 42003.28 |
test_resnet | 150790.14 | 27286.95 |
test_nfnet | 78314.21 | 15282.57 |
Througput @ 160x160 w/ torch.compile, mode='default', PyTorch 2.4.1, RTX4090
Output of python benchmark.py --amp --model 'test_*' --fast-norm --torchcompile
:
model | infer_samples_per_sec | train_samples_per_sec |
---|---|---|
test_efficientnet | 192256.16 | 30972.05 |
test_efficientnet_ln | 186221.3 | 28402.3 |
test_efficientnet_evos | 180578.68 | 32651.59 |
test_convnext3 | 179679.28 | 34998.59 |
test_byobnet | 177707.5 | 32309.83 |
test_efficientnet_gn | 169962.75 | 31801.23 |
test_convnext2 | 166527.39 | 37168.73 |
test_resnet | 157618.18 | 25159.21 |
test_vit | 146050.34 | 38321.33 |
test_convnext | 138397.51 | 27930.18 |
test_vit2 | 116394.63 | 26856.88 |
test_vit3 | 89157.52 | 21656.06 |
test_nfnet | 71030.73 | 14720.19 |
Details
The model names above give some hint as to what they are, but I did explore some 'unique' architecture variations that are worth mentioning for any who might try them.
test_byobnet
A ByobNet (mix of EfficientNet / ResNet / DarkNet blocks)
- stage blocks = 1 * EdgeResidual (FusedMBConv), 1 * DarkBlock, 1 * ResNeXt Basic (group_size=32), 1 * ResNeXt Bottle (group_size=64)
- channels = 32, 64, 128, 256
- se_ratio = .25 (active in all blocks)
- act_layer = ReLU
- norm_layer = BatchNorm
test_convnext
A ConvNeXt
- stage depths = 1, 2, 4, 2
- channels = 24, 32, 48, 64
- DW kernel_size = 7, 7, 7, 7
- act_layer = GELU (tanh approx)
- norm_layer = LayerNorm
test_convnext2
A ConvNeXt
- stage depths = 1, 1, 1, 1
- channels = 32, 64, 96, 128
- DW kernel_size = 7, 7, 7, 7
- act_layer = GELU (tanh approx)
- norm_layer = LayerNorm
test_convnext3
A ConvNeXt w/ SiLU and varying kernel size
- stage depths = 1, 1, 1, 1
- channels = 32, 64, 96, 128
- DW kernel_size = 7, 5, 5, 3
- act_layer = SiLU
- norm_layer = LayerNorm
test_efficientnet
An EfficientNet w/ V2 block mix
- stage blocks = 1 * ConvBnAct, 2 * EdgeResidual (FusedMBConv), 2 * InvertedResidual (MBConv) w/ SE
- channles = 16, 24, 32, 48, 64
- kernel_size = 3x3 for all
- expansion = 4x for all
- stem_size = 24
- act_layer = SiLU
- norm_layer = BatchNorm
test_efficientnet_gn
An EfficientNet w/ V2 block mix and GroupNorm (group_size=8)
- See above but with norm_layer=GroupNorm
test_efficientnet_ln
An EfficientNet w/ V2 block mix and LayerNorm
- See above but with norm_layer=LayerNorm
test_efficientnet_evos
An EfficientNet w/ V2 block mix and EvoNorm-S
- See above but with EvoNormS for norm + act
test_nfnet
A NormFree Net:
- 4-stages, 1 block per stage
- channels = 32, 64, 96, 128
- group_size = 8
- bottle_ratio = 0.25
- se_ratio = 0.25
- act_layer = SiLU
- norm_layer = no norm, Scaled Weight Standardization is part of Convolution
test_resnet
A ResNet w/ mixed blocks:
- stage blocks = 1 * BasicBlock, 1 * BasicBlock, 1 * BottleNeck, 1 * BasicBlock
- channels = 32, 48, 48, 96
- deep 3x3 stem (aka ResNet-D)
- avg pool in downsample (aka ResNet-D)
- stem_width = 16
- act_layer = ReLU
- norm_layer = BatchNorm
test_vit
A vanilla ViT w/ class token:
- patch_size = 16
- embed_dim = 64
- num_heads = 2
- mlp_ratio = 3
- depth = 6
- act_layer = GELU
- norm_layer = LayerNorm
test_vit2
A ViT w/ global avg pool, 1 reg token, layer-scale (like timm
SBB ViTs https://huggingface.co/collections/timm/searching-for-better-vit-baselines-663eb74f64f847d2f35a9c19):
- patch_size = 16
- embed_dim = 64
- num_heads = 2
- mlp_ratio = 3
- depth = 8
- act_layer = GELU
- norm_layer = LayerNorm
test_vit3
A ViT w/ attention-pool, 1 reg token, layer-scale.
- patch_size = 16
- embed_dim = 96
- num_heads = 3
- mlp_ratio = 2
- depth = 9
- act_layer = GELU
- norm_layer = LayerNorm