Tagging errors
I was thumbing through the data set and found a significant number of tagging errors, just from casual examination. On testing the checkpoint, I'm seeing lots of concept coherence errors.
I suggest stripping the tags from the dataset and rerunning it through a CLIP Interrogator and/or DeepBooru Interrogator to get more accurate tags, as it seems data hygiene on the initial scrape left a lot to be desired. Upon retraining the model, I think the outputs will be vastly improved simply by fixing the tagging errors.
@Vikingunleashed I agree, the dataset needs to be improved by removing some tags and adding others, this is something I'm constantly doing, I'm using Hydrus to make the dataset and the tags are getting improved overtime, I manually do some cleanup every now and then and I'm also making use of the Hydrus PTR to get more and better tags for the images on the dataset but right now the issue is with Hydrus DeepDanbooru which is an extension/script I'm using to add some extra tags, the model I'm using is not that good so, I'm trying to find a solution for this, either by training a better model or by using an alternative to DeepDanbooru, I was considering using CLIP interrogator but need to make a script for it to read the Hydrus client API to get the images and then send the tags to it after interrogating an image, we can actually use as many things to add tags as we want as long as the tags are good, I was also considering translating every tag to different languages to make the model work better with different languages as it has proven to be able to work with more than just English as it can now work with Spanish, Chinese and Japanese partially.
@ZeroCool94
Cool, glad you're already on it. That's a lot of work to do by hand, kudos for that.
I did not know Hydrus was a thing and wow that is useful. Definitely gonna use that for sorting my own collections and images. Many thanks!
On a slight tangent, I was just going through the Laion-5B dataset used for Stable Diffusion 2.x. The tagging and dataset is surprisingly good; I was expecting lots of mistags and such. There's a lot of junk images too, but I was just thinking that if you built your model off of 2.0 full ema, instead 1.5, you'd probably get better results overall. That combined with your tag fixes, your model would be pretty damn good.
I was planning to do another version of the model based on Stable Diffusion 2.x but decided to first make a fine tune using 1.5 as base, the idea was more like testing the dataset and see what we could get by training a model with lots of different tags and stuff from the INE subreddit or other places, also to test how good would be to use Hydrus and other stuff to automate some of the process of the dataset creation, with hydrus we can practically subscribe to any site and download stuff from it to create a dataset, even if there is not a downloader for it we can just easily make it, then we can just export the data from there, we can even automate that as hydrus allow you to export automatically and we have some options for that like doing it regularly at a fixed time, what I plan to do is have hydrus subscribe to a site and download constantly new data, get tags from the original site plus the hydrus PTR, then have some scripts to add more tags and do translation, after that have it export the data automatically every 1 hour or other fixed time and lastly have the training script restart automatically every 1 epoch so it can see the new data or just make it so there is no need to restart and just reload the dataset from disk. All these should practically make it so we can automate the whole training process, we would just have to focus on cleaning the data or changing the training parameters if it's needed.
On the LAION-5B topic, you would think it's good but if you look closer and deeper into it, you will see that it's practically useless as it is right now, the reason you need so many images from it to train is not actually because of the images themselves but the captions, LAION only has a single caption per image and it's taken from the site title or the article from where the image was found, it's practically a broad scraper and 90% of the captions are not even related to the image or only a single word or two on the caption are relevant, so, here is the thing, on the INE dataset every image has at least 20 tags on them, out of those 4 are usually taken from the subreddit when the image is downloaded, 1 is the title which would be the exact same as the caption on the LAION dataset and 1 is the uploader or author of the post from where the image was taken, so, even those are just better than LAION captions as the titles are usually the topic of the image or at least related and it usually ends with the artist name with something like by some artist
, with that we have some useful information about the image. On top of what I mentioned before you can think of each tag on the image as the equivalent of 1 image on the LAION dataset, so, 1 image on our dataset with 20 images would be the same as 20 images on the LAION dataset, even if some tags are wrong as long as some are right it would compensate partially for the wrong ones, think of it as just bad captions on the LAION dataset which make it to the training anyway, also, the LAION dataset has a fundamental problem, because it uses the filename to add the caption it can only have one caption per image, we use a text file next to the image with the same filename as the image to have the tags there and we can have as many tags as we want on it. Another thing is that we are using fewer images with more tags and as we continue to improve the tags the model will continue to be better, this make it so we can actually reduce the dataset size as we increase the quality of the tags in case we need it which will then improve the training speed as it's less data to process.