Anime Tagger caformer_b36.pexelsv0-full
Model Details
- Model Type: Multilabel Image classification / feature backbone
- Model Stats:
- Params: 152.3M
- FLOPs / MACs: 132.3G / 66.0G
- Image size: train = 384 x 384, test = 384 x 384
- Dataset: animetimm/pexels-tagger-v0-w640-ws-full
- Tags Count: 18440
- Nature (#0) Tags Count: 2384
- People (#1) Tags Count: 1748
- Architecture (#2) Tags Count: 2288
- Animals (#3) Tags Count: 973
- Emotion (#4) Tags Count: 919
- Style (#5) Tags Count: 1841
- Activity (#6) Tags Count: 2081
- Time (#7) Tags Count: 185
- Colorlight (#8) Tags Count: 637
- Detail (#9) Tags Count: 520
- Food (#10) Tags Count: 1221
- Transport (#11) Tags Count: 709
- Culture (#12) Tags Count: 1219
- Art (#13) Tags Count: 990
- Technology (#14) Tags Count: 725
- Tags Count: 18440
Results
| # | Macro@0.40 (F1/MCC/P/R) | Micro@0.40 (F1/MCC/P/R) | Macro@Best (F1/P/R) |
|---|---|---|---|
| Validation | 0.283 / 0.294 / 0.370 / 0.254 | 0.524 / 0.527 / 0.599 / 0.465 | --- |
| Test | 0.284 / 0.295 / 0.371 / 0.255 | 0.525 / 0.528 / 0.600 / 0.466 | 0.347 / 0.347 / 0.406 |
Macro/Micro@0.40means the metrics on the threshold 0.40.Macro@Bestmeans the mean metrics on the tag-level thresholds on each tags, which should have the best F1 scores.
Thresholds
| Category | Name | Alpha | Threshold | Micro@Thr (F1/P/R) | Macro@0.40 (F1/P/R) | Macro@Best (F1/P/R) |
|---|---|---|---|---|---|---|
| 0 | nature | 1 | 0.34 | 0.601 / 0.617 / 0.587 | 0.255 / 0.360 / 0.229 | 0.329 / 0.321 / 0.394 |
| 1 | people | 1 | 0.31 | 0.532 / 0.532 / 0.532 | 0.304 / 0.396 / 0.273 | 0.360 / 0.374 / 0.406 |
| 2 | architecture | 1 | 0.3 | 0.516 / 0.509 / 0.523 | 0.258 / 0.351 / 0.225 | 0.325 / 0.328 / 0.384 |
| 3 | animals | 1 | 0.37 | 0.624 / 0.633 / 0.615 | 0.395 / 0.444 / 0.384 | 0.460 / 0.440 / 0.548 |
| 4 | emotion | 1 | 0.29 | 0.529 / 0.519 / 0.539 | 0.193 / 0.310 / 0.160 | 0.253 / 0.269 / 0.294 |
| 5 | style | 1 | 0.29 | 0.501 / 0.488 / 0.515 | 0.211 / 0.317 / 0.178 | 0.274 / 0.281 / 0.325 |
| 6 | activity | 1 | 0.31 | 0.544 / 0.537 / 0.551 | 0.326 / 0.393 / 0.300 | 0.388 / 0.375 / 0.457 |
| 7 | time | 1 | 0.29 | 0.550 / 0.537 / 0.563 | 0.201 / 0.337 / 0.172 | 0.272 / 0.294 / 0.341 |
| 8 | colorlight | 1 | 0.24 | 0.429 / 0.413 / 0.447 | 0.185 / 0.297 / 0.155 | 0.253 / 0.259 / 0.303 |
| 9 | detail | 1 | 0.27 | 0.462 / 0.476 / 0.449 | 0.217 / 0.318 / 0.186 | 0.287 / 0.287 / 0.343 |
| 10 | food | 1 | 0.33 | 0.552 / 0.534 / 0.571 | 0.353 / 0.405 / 0.329 | 0.407 / 0.394 / 0.470 |
| 11 | transport | 1 | 0.36 | 0.554 / 0.566 / 0.542 | 0.327 / 0.385 / 0.305 | 0.394 / 0.367 / 0.481 |
| 12 | culture | 1 | 0.29 | 0.501 / 0.491 / 0.511 | 0.327 / 0.421 / 0.292 | 0.384 / 0.406 / 0.424 |
| 13 | art | 1 | 0.27 | 0.452 / 0.442 / 0.463 | 0.292 / 0.371 / 0.261 | 0.354 / 0.352 / 0.411 |
| 14 | technology | 1 | 0.3 | 0.494 / 0.502 / 0.486 | 0.350 / 0.427 / 0.321 | 0.407 / 0.419 / 0.460 |
Micro@Thrmeans the metrics on the category-level suggested thresholds, which are listed in the table above.Macro@0.40means the metrics on the threshold 0.40.Macro@Bestmeans the metrics on the tag-level thresholds on each tags, which should have the best F1 scores.
For tag-level thresholds, you can find them in selected_tags.csv.
How to Use
We provided a sample image for our code samples, you can find it here.
Use TIMM And Torch
Install dghs-imgutils, timm and other necessary requirements with the following command
pip install 'dghs-imgutils>=0.17.0' torch huggingface_hub timm pillow pandas
After that you can load this model with timm library, and use it for train, validation and test, with the following code
import json
import pandas as pd
import torch
from huggingface_hub import hf_hub_download
from imgutils.data import load_image
from imgutils.preprocess import create_torchvision_transforms
from timm import create_model
repo_id = 'animetimm/caformer_b36.pexelsv0-full'
model = create_model(f'hf-hub:{repo_id}', pretrained=True)
model.eval()
with open(hf_hub_download(repo_id=repo_id, repo_type='model', filename='preprocess.json'), 'r') as f:
preprocessor = create_torchvision_transforms(json.load(f)['test'])
# Compose(
# PadToSize(size=(512, 512), interpolation=bilinear, background_color=white)
# Resize(size=384, interpolation=bicubic, max_size=None, antialias=True)
# CenterCrop(size=[384, 384])
# MaybeToTensor()
# Normalize(mean=tensor([0.4850, 0.4560, 0.4060]), std=tensor([0.2290, 0.2240, 0.2250]))
# )
image = load_image('https://huggingface.co/animetimm/caformer_b36.pexelsv0-full/resolve/main/sample.webp')
input_ = preprocessor(image).unsqueeze(0)
# input_, shape: torch.Size([1, 3, 384, 384]), dtype: torch.float32
with torch.no_grad():
output = model(input_)
prediction = torch.sigmoid(output)[0]
# output, shape: torch.Size([1, 18440]), dtype: torch.float32
# prediction, shape: torch.Size([18440]), dtype: torch.float32
df_tags = pd.read_csv(
hf_hub_download(repo_id=repo_id, repo_type='model', filename='selected_tags.csv'),
keep_default_na=False
)
tags = df_tags['name']
mask = prediction.numpy() >= df_tags['best_threshold']
print(dict(zip(tags[mask].tolist(), prediction[mask].tolist())))
# {'outdoors': 0.8146495223045349,
# 'nature': 0.6344019174575806,
# 'fashion': 0.8642972111701965,
# 'urban': 0.5123583674430847,
# 'woman': 0.8229815363883972,
# 'portrait': 0.6284889578819275,
# 'summer': 0.5277450680732727,
# 'modern': 0.36526161432266235,
# 'elegant': 0.6213363409042358,
# 'lifestyle': 0.3345010578632355,
# 'greenery': 0.7012176513671875,
# 'trees': 0.8490742444992065,
# 'daylight': 0.441753625869751,
# 'natural light': 0.5182833671569824,
# 'fashionable': 0.6848584413528442,
# 'stylish': 0.5571858882904053,
# 'fashion photography': 0.3972109258174896,
# 'north america': 0.9506323337554932,
# 'city life': 0.36047741770744324,
# 'park': 0.8430405855178833,
# 'confidence': 0.48355913162231445,
# 'long hair': 0.44347313046455383,
# 'fashion model': 0.29179278016090393,
# 'urban setting': 0.257258802652359,
# 'professional': 0.3694036304950714,
# 'street style': 0.3732738196849823,
# 'confident': 0.2515304684638977,
# 'sunny day': 0.25001078844070435,
# 'urban fashion': 0.39027780294418335,
# 'outdoor portrait': 0.5426234006881714,
# 'outdoor photography': 0.09817072004079819,
# 'mexico': 0.9254768490791321,
# 'modern fashion': 0.19535645842552185,
# 'fashion shoot': 0.18118932843208313,
# 'blonde': 0.3437434434890747,
# 'portrait photography': 0.20591281354427338,
# 'urban park': 0.6445800065994263,
# 'blonde hair': 0.3630153238773346,
# 'outdoor fashion': 0.25335976481437683,
# 'canon eos': 0.07552232593297958,
# 'ciudad de mexico': 0.9318879246711731,
# 'mexico city': 0.5673938393592834,
# 'female portrait': 0.2128865271806717,
# 'cdmx': 0.8146464228630066,
# 'city park': 0.7205607295036316,
# 'beautiful woman': 0.8344939947128296,
# 'sexy': 0.8086201548576355,
# 'casual elegance': 0.39626604318618774,
# 'confident pose': 0.14261074364185333,
# 'bright day': 0.04884874075651169,
# 'modern woman': 0.2812161445617676,
# 'fashionable woman': 0.07972031831741333,
# 'stylish outfit': 0.07389499992132187,
# 'blond hair': 0.20116303861141205,
# 'white blouse': 0.5901707410812378,
# 'business casual': 0.28336629271507263,
# 'park setting': 0.1550087332725525,
# 'smiling woman': 0.37095949053764343,
# 'blonde woman': 0.16900084912776947,
# 'stylish attire': 0.03601783514022827,
# 'professional look': 0.16206032037734985,
# 'city landscape': 0.022534040734171867,
# 'blonde girl': 0.9979878664016724,
# 'tree-lined path': 0.14010290801525116,
# 'trousers': 0.11853884160518646,
# 'portrait art': 0.9996036887168884,
# 'mexico mujer': 0.9847078323364258,
# 'blond woman': 0.9997660517692566,
# 'girl sexy': 0.9878467917442322,
# 'sexy woman': 0.9642167687416077,
# 'light brown hair': 0.042917393147945404,
# 'mx': 0.9991982579231262}
Use ONNX Model For Inference
Install dghs-imgutils with the following command
pip install 'dghs-imgutils>=0.17.0'
Use multilabel_timm_predict function with the following code
from imgutils.generic import multilabel_timm_predict
nature, people, architecture, animals, emotion, style, activity, time, colorlight, detail, food, transport, culture, art, technology = multilabel_timm_predict(
'https://huggingface.co/animetimm/caformer_b36.pexelsv0-full/resolve/main/sample.webp',
repo_id='animetimm/caformer_b36.pexelsv0-full',
fmt=('nature', 'people', 'architecture', 'animals', 'emotion', 'style', 'activity', 'time', 'colorlight', 'detail', 'food', 'transport', 'culture', 'art', 'technology'),
)
print(nature)
# {'trees': 0.8490744829177856,
# 'park': 0.8430417776107788,
# 'outdoors': 0.8146489858627319,
# 'greenery': 0.7012186646461487,
# 'nature': 0.634401798248291,
# 'park setting': 0.15501540899276733,
# 'tree-lined path': 0.14011254906654358,
# 'outdoor photography': 0.09817290306091309}
print(people)
# {'blond woman': 0.9997661113739014,
# 'portrait art': 0.999603807926178,
# 'blonde girl': 0.9979879856109619,
# 'girl sexy': 0.9878476858139038,
# 'mexico mujer': 0.9847087860107422,
# 'sexy woman': 0.9642203450202942,
# 'beautiful woman': 0.8344992399215698,
# 'woman': 0.8229814767837524,
# 'sexy': 0.808624804019928,
# 'portrait': 0.6284893155097961,
# 'outdoor portrait': 0.5426267385482788,
# 'long hair': 0.4434756636619568,
# 'smiling woman': 0.3709729015827179,
# 'professional': 0.36940687894821167,
# 'blonde hair': 0.36302047967910767,
# 'blonde': 0.3437475562095642,
# 'fashion model': 0.2917952537536621,
# 'modern woman': 0.28122496604919434,
# 'female portrait': 0.21289172768592834,
# 'portrait photography': 0.2059168815612793,
# 'blond hair': 0.20117083191871643,
# 'blonde woman': 0.16900965571403503,
# 'confident pose': 0.14261576533317566,
# 'fashionable woman': 0.07972347736358643,
# 'light brown hair': 0.042922526597976685}
print(architecture)
# {'ciudad de mexico': 0.931889533996582,
# 'cdmx': 0.814650297164917,
# 'city park': 0.7205668687820435,
# 'urban park': 0.644584596157074,
# 'mexico city': 0.5673993825912476,
# 'urban': 0.5123592615127563,
# 'city life': 0.36047929525375366,
# 'urban setting': 0.25726157426834106,
# 'city landscape': 0.022535890340805054}
print(animals)
# {}
print(emotion)
# {'confidence': 0.48356157541275024, 'confident': 0.2515334188938141}
print(style)
# {'fashion': 0.8642969727516174,
# 'fashionable': 0.6848594546318054,
# 'elegant': 0.6213374733924866,
# 'white blouse': 0.5901826024055481,
# 'stylish': 0.5571873188018799,
# 'fashion photography': 0.39721226692199707,
# 'casual elegance': 0.3962761163711548,
# 'urban fashion': 0.39028140902519226,
# 'street style': 0.37327611446380615,
# 'modern': 0.3652629852294922,
# 'business casual': 0.28337597846984863,
# 'outdoor fashion': 0.25336480140686035,
# 'modern fashion': 0.19535967707633972,
# 'fashion shoot': 0.18119221925735474,
# 'professional look': 0.16206982731819153,
# 'trousers': 0.1185469925403595,
# 'stylish outfit': 0.07389819622039795,
# 'stylish attire': 0.036020517349243164}
print(activity)
# {'lifestyle': 0.3345027565956116}
print(time)
# {'summer': 0.5277461409568787,
# 'daylight': 0.4417559504508972,
# 'sunny day': 0.25001394748687744}
print(colorlight)
# {'natural light': 0.5182850956916809, 'bright day': 0.04885092377662659}
print(detail)
# {}
print(food)
# {}
print(transport)
# {}
print(culture)
# {'mx': 0.9991983771324158,
# 'north america': 0.9506326913833618,
# 'mexico': 0.9254778623580933}
print(art)
# {}
print(technology)
# {'canon eos': 0.07552513480186462}
For further information, see documentation of function multilabel_timm_predict.
Citation
@misc{caformer_b36_pexelsv0_full,
title = {Anime Tagger caformer_b36.pexelsv0-full},
author = {narugo1992 and Deep Generative anime Hobbyist Syndicate (animetimm)},
year = {2025},
howpublished = {\url{https://huggingface.co/animetimm/caformer_b36.pexelsv0-full}},
note = {A large-scale anime-style image classification model based on caformer_b36 architecture for multi-label tagging with 18440 tags, trained on anime dataset pexelsv0-full (\url{https://huggingface.co/datasets/animetimm/pexels-tagger-v0-w640-ws-full}). Model parameters: 152.3M, FLOPs: 132.3G, input resolution: 384×384.},
license = {gpl-3.0}
}
- Downloads last month
- -
Model tree for animetimm/caformer_b36.pexelsv0-full
Base model
timm/caformer_b36.sail_in22k_ft_in1k_384