what kind of contet goes into training?
is your aim more general or is there any specific biases for types of content?
I want to preserve the natural distribution (and thus breadth of content) that goes into the model, so for the most part in smaller datasets it is a random subsample of booru content (with the only filtering being minimum score)
Right now I'm working on training across all of e6 (and ideally a sizeable portion of danbooru, as to not wipe out the anime knowledge that base noob has)
“random subsample.. only filtering minimum score” is this absolute? No personal bias towards exclusions? Thats cool if so. Especially with expanding Noob with more e6 which naturally means less humanoid bias. Which even by a minute amount is a godsend compared to the available alternatives.
Aha, glad to hear, and yes, there's no intentional bias. Once I have access to compute again, I'll be gathering random danbooru content as well (lesser so than e6, which is already 5M).
I used https://huggingface.co/datasets/VelvetToroyashi/BigE6/tree/main in some earlier training, which is just a mix of top and random posts with each category (safe, questionable, explicit)
Later training was aforementioned random subsample of e6 (150k samples), and most of my experimentals (and V10 preview) are a mix of 43k low-post artists across e6 and danbooru, and all new content (June 2024 - August 2025) with score > 10 I believe
Further training (V11) will be all of e6 and random danbooru subset (maybe the same strategy of incorporating new content in the model)