Why I didn't include quants, why I used this implementation
A lot of people asked me this on Discord and Reddit, so here it is in a nutshell:
Vision support leaves a lot to be desired. There are numerous issues when using various popular frontends. While some work—or partially work—the provided code delivers the best results in terms of accuracy, compatibility, and bulk inference (you can infer 999 images with a single command, though it will obviously take some time).
The code also offers ease of use and flexibility: just put your prompt in 'prompts.txt' and you're set. Want to run several prompts? Simply add another line; each line is treated as a separate prompt.
Do I think this implementation is awesome? Hell no, I really dislike it. But it’s what works best. Even though vision has existed for years, support was consistently poor across the board. Now, in 2025, we're likely to see drastic improvements
I'll be honest, I've been waiting for a true nsfw vision support for years now for moderation and real world application and tried testing others and they just shut down, so to me they are useless, the real world is not always a safe playpen for toddlers. so this is very much appreciated and I'm going to give it a try, thank you for working on this.