hello i have a big question
I see you created many different versions of v3, so are they actually any different? and how should I use each version for its own purposes?
I read a paper with a network architecture I like, I implement said architecture, then I train it from scratch.
The v3 models currently out are implementations of those papers:
ViT: https://arxiv.org/abs/2010.11929
ConvNext: https://arxiv.org/abs/2201.03545
SwinV2: https://arxiv.org/abs/2111.09883
They all do the same thing (get an image in input, give tag probabilities as output), but get there differently.
Does this answer the question?
So you just know it's different, which one is better than the other, and which one specializes in which art style, you don't know either
Yeah I have no idea if SwinV2 works better on some specific images and ConvNext/ViT work better on others.
Going on a limb, ConvNext might be better suited to deal with rotated images, while ViT might work better at character recognition given how transformers can model long range dependencies and a character might be defined by a few details scattered through the entire image, but I never run any deep test in this sense.
Post the results if you happen to run any such test!