Tiny Test Models

Community Article Published October 2, 2024

I've recently trained a set of tiny test models (https://huggingface.co/collections/timm/timm-tiny-test-models-66f18bd70518277591a86cef) on ImageNet-1k covering several of the most popular architecture families.

It takes ~10 seconds to download all 13 pretrained weights and run one step of inference on each w/ a ratty old CPU (but fast internet connection). It will allow quick verification of model functionality, from pretrained weight download through every API feature of full size models. They differ from full size models in that they have lower default resolution and typically 1 block per stage, very narrow widths.

This is all well and good, but would anyone have any interest in these outside of tests? Well, this is where you come in. These are some of the smallest models that are decently trained on ImageNet-1k. They use a recent training recipe adapted from MobileNet-v4 (Conv-Small), a good recipe for squeezing accuracy from small models. The top-1 are by no means impressive, but the models do work well on fine-tune for small datasets, and I imagine they could work quite well for some reduced resource (embedded) applications or as part of reinforcement learning vision policies.

Let me know if you find any good applications for them outside of tests. Here's the summary of the model results, they were trained natively at 160x160, and most models see a small pickup at 192x192 by leveraging the train-test discrepancy.

ImageNet Accuracy

model	img_size	top1	top5	param_count	norm
test_vit3.r160_in1k	192	58.116	81.876	0.93	LN
test_vit3.r160_in1k	160	56.894	80.748	0.93	LN
test_convnext3.r160_in1k	192	54.558	79.356	0.47	LN
test_convnext2.r160_in1k	192	53.62	78.636	0.48	LN
test_convnext2.r160_in1k	160	53.51	78.526	0.48	LN
test_convnext3.r160_in1k	160	53.328	78.318	0.47	LN
test_convnext.r160_in1k	192	48.532	74.944	0.27	LN
test_nfnet.r160_in1k	192	48.298	73.446	0.38	WS
test_convnext.r160_in1k	160	47.764	74.152	0.27	LN
test_nfnet.r160_in1k	160	47.616	72.898	0.38	WS
test_efficientnet.r160_in1k	192	47.164	71.706	0.36	BN
test_efficientnet_evos.r160_in1k	192	46.924	71.53	0.36	EVOS
test_byobnet.r160_in1k	192	46.688	71.668	0.46	BN
test_efficientnet_evos.r160_in1k	160	46.498	71.006	0.36	EVOS
test_efficientnet.r160_in1k	160	46.454	71.014	0.36	BN
test_byobnet.r160_in1k	160	45.852	70.996	0.46	BN
test_efficientnet_ln.r160_in1k	192	44.538	69.974	0.36	LN
test_efficientnet_gn.r160_in1k	192	44.448	69.75	0.36	GN
test_efficientnet_ln.r160_in1k	160	43.916	69.404	0.36	LN
test_efficientnet_gn.r160_in1k	160	43.88	69.162	0.36	GN
test_vit2.r160_in1k	192	43.454	69.798	0.46	LN
test_resnet.r160_in1k	192	42.376	68.744	0.47	BN
test_vit2.r160_in1k	160	42.232	68.982	0.46	LN
test_vit.r160_in1k	192	41.984	68.64	0.37	LN
test_resnet.r160_in1k	160	41.578	67.956	0.47	BN
test_vit.r160_in1k	160	40.946	67.362	0.37	LN

Througput @ 160x160 w/ torch.compile, mode='max-autotune', PyTorch 2.4.1, RTX4090

model	infer_samples_per_sec	train_samples_per_sec
test_vit	300560.67	87518.73
test_vit2	254514.84	70132.93
test_convnext	216367.11	50905.24
test_convnext3	200783.46	49074.48
test_byobnet	199426.55	49487.12
test_convnext2	196727.0	48119.64
test_efficientnet	181404.48	43546.96
test_efficientnet_ln	173432.33	33280.66
test_efficientnet_evos	169177.92	39684.92
test_vit3	163786.54	44318.45
test_efficientnet_gn	158421.02	44226.92
test_resnet	153289.49	28341.52
test_nfnet	80837.46	16907.38

Througput @ 160x160 w/ torch.compile, mode='reduce-overhead', PyTorch 2.4.1, RTX4090

model	infer_samples_per_sec	train_samples_per_sec
test_vit	274007.61	86652.08
test_vit2	231651.39	68993.91
test_byobnet	197767.6	48633.6
test_convnext	184134.55	46879.08
test_efficientnet	170239.18	42812.1
test_efficientnet_ln	166604.2	31946.88
test_efficientnet_evos	163667.41	42222.59
test_vit3	161792.13	45354.67
test_convnext2	160601.75	43187.22
test_convnext3	160494.65	44304.95
test_efficientnet_gn	155447.85	42003.28
test_resnet	150790.14	27286.95
test_nfnet	78314.21	15282.57

Througput @ 160x160 w/ torch.compile, mode='default', PyTorch 2.4.1, RTX4090

Output of python benchmark.py --amp --model 'test_*' --fast-norm --torchcompile:

model	infer_samples_per_sec	train_samples_per_sec
test_efficientnet	192256.16	30972.05
test_efficientnet_ln	186221.3	28402.3
test_efficientnet_evos	180578.68	32651.59
test_convnext3	179679.28	34998.59
test_byobnet	177707.5	32309.83
test_efficientnet_gn	169962.75	31801.23
test_convnext2	166527.39	37168.73
test_resnet	157618.18	25159.21
test_vit	146050.34	38321.33
test_convnext	138397.51	27930.18
test_vit2	116394.63	26856.88
test_vit3	89157.52	21656.06
test_nfnet	71030.73	14720.19

Details

The model names above give some hint as to what they are, but I did explore some 'unique' architecture variations that are worth mentioning for any who might try them.

test_byobnet

A ByobNet (mix of EfficientNet / ResNet / DarkNet blocks)

stage blocks = 1 * EdgeResidual (FusedMBConv), 1 * DarkBlock, 1 * ResNeXt Basic (group_size=32), 1 * ResNeXt Bottle (group_size=64)
channels = 32, 64, 128, 256
se_ratio = .25 (active in all blocks)
act_layer = ReLU
norm_layer = BatchNorm

test_convnext

A ConvNeXt

stage depths = 1, 2, 4, 2
channels = 24, 32, 48, 64
DW kernel_size = 7, 7, 7, 7
act_layer = GELU (tanh approx)
norm_layer = LayerNorm

test_convnext2

A ConvNeXt

stage depths = 1, 1, 1, 1
channels = 32, 64, 96, 128
DW kernel_size = 7, 7, 7, 7
act_layer = GELU (tanh approx)
norm_layer = LayerNorm

test_convnext3

A ConvNeXt w/ SiLU and varying kernel size

stage depths = 1, 1, 1, 1
channels = 32, 64, 96, 128
DW kernel_size = 7, 5, 5, 3
act_layer = SiLU
norm_layer = LayerNorm

test_efficientnet

An EfficientNet w/ V2 block mix

stage blocks = 1 * ConvBnAct, 2 * EdgeResidual (FusedMBConv), 2 * InvertedResidual (MBConv) w/ SE
channles = 16, 24, 32, 48, 64
kernel_size = 3x3 for all
expansion = 4x for all
stem_size = 24
act_layer = SiLU
norm_layer = BatchNorm

test_efficientnet_gn

An EfficientNet w/ V2 block mix and GroupNorm (group_size=8)

See above but with norm_layer=GroupNorm

test_efficientnet_ln

An EfficientNet w/ V2 block mix and LayerNorm

See above but with norm_layer=LayerNorm

test_efficientnet_evos

An EfficientNet w/ V2 block mix and EvoNorm-S

See above but with EvoNormS for norm + act

test_nfnet

A NormFree Net:

4-stages, 1 block per stage
channels = 32, 64, 96, 128
group_size = 8
bottle_ratio = 0.25
se_ratio = 0.25
act_layer = SiLU
norm_layer = no norm, Scaled Weight Standardization is part of Convolution

test_resnet

A ResNet w/ mixed blocks:

stage blocks = 1 * BasicBlock, 1 * BasicBlock, 1 * BottleNeck, 1 * BasicBlock
channels = 32, 48, 48, 96
deep 3x3 stem (aka ResNet-D)
avg pool in downsample (aka ResNet-D)
stem_width = 16
act_layer = ReLU
norm_layer = BatchNorm

test_vit

A vanilla ViT w/ class token:

patch_size = 16
embed_dim = 64
num_heads = 2
mlp_ratio = 3
depth = 6
act_layer = GELU
norm_layer = LayerNorm

test_vit2

A ViT w/ global avg pool, 1 reg token, layer-scale (like timm SBB ViTs https://huggingface.co/collections/timm/searching-for-better-vit-baselines-663eb74f64f847d2f35a9c19):

patch_size = 16
embed_dim = 64
num_heads = 2
mlp_ratio = 3
depth = 8
act_layer = GELU
norm_layer = LayerNorm

test_vit3

A ViT w/ attention-pool, 1 reg token, layer-scale.

patch_size = 16
embed_dim = 96
num_heads = 3
mlp_ratio = 2
depth = 9
act_layer = GELU
norm_layer = LayerNorm

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote