File size: 5,635 Bytes

a640ba0

TL;DR: Render each character individually and merge later to avoid SD models freaking out on multiple subjects.

I got this group photo from a frisbee game and was inspired to make it a digital painting using SD to share with friends. In the beginning, I used the naïve approach of simply trying img2img with the help of multi-ControlNet to keep the character poses/faces consistent, and I quickly learned that SD really doesn’t excel at rendering multiple people all at once; no matter how I try, it always add more people into the photo (or barely does anything when denoising was set too low). When it comes to subjects with interactions (e.g. hand on shoulder, arm around shoulder), it almost never figured out the correct anatomy. Further, trying to inpaint anything with precision is a headache due to the complex context when there are multiple subjects with preferably different styles in such near vicinity. So, I came up with this approach of rendering characters individually,and I want to share everything I have learned, which I personally think works much better than the naïve approach.

The first step is to get as much information about each subject from the group photo as possible; I used the MediaPipe face preprocessor to extract the face annotations and OpenPose editor (yes, manually, that way you can extrapolate or alter each subject’s posture) for the characters’ poses (Img. 3). I also used ControlNet hed to provide extra detail and consolidate each subject’s silhouette/hand (Img. 6). Notice that in Img. 6, the hed for each character is significantly cleaned up (Photoshop black brush) compared to what came out of the preprocessor since we just want important lines, and obviously we don’t want the information of other subjects. The faces of all subjects in the hed are also erased, since I found that it is somehow working against me. It is worth noting that the face, pose and hed annotations (i.e. the relative position of each subject in Img. 5 and Img. 6) must be framed with identical spatial positions so they don’t contradict each other in multi-ControlNet.

The second step is to start generating the characters with multi-ControlNet. As stated, I am using face, pose and hed, but I believe hed could be swapped by canny (not tested). The keyword here is “white background” or “greenscreen” to make cutting out the characters later easier. I chose white background for presentation purposes, but you might want to use greenscreen to save you the pain of removing fuzzy edges in post (not tested). What I learned from generating the subjects is that although most of the models work superbly at creating simple background, they just don’t when I introduce ControlNet. They will tend to drift towards paint splash, two-tone backgrounds (look at Img. 7 for my choice of the subject generations, orange/cyan/black splashes occur even when I heavily negative prompted it), etc. I mainly look for the quality and the resemblance to the original photo and remove all unwanted content with Photoshop. The characters that do not interact with other characters (Img. 4, second row) are very easy to generate. For the couples that do interact (Img. 4, first row), I found success in generating both subjects at the same time on rare occasions, and I ultimately continued the paradigm of handling subjects separately. Given that the annotations for each of the couple are framed spatially consistently, you could generate them separately, merge them in post, and inpaint the interaction part (Img. 8, manually drawn the “hand” to provide inpaint guidance). Also, when picking the results, try to pick characters that has similar lighting direction if possible.

With all your characters upscaled and fine-tuned, the third step is to bring them into an external editor, remove their respective background, and paste them onto a background image (generated separately). It is vital to keep the edges as fuzzless as possible (explained in the next step).

Now, the fourth and final step. We have in our hands the finished characters in an appropriate background, but levitating with no shadows. Here comes the MAGICAL SECRET INGREDIENT: create a mask of the characters using the external editor, the character sections masked black and the background masked white. Go to img2img -> inpaint upload, and upload the image/mask (e.g. Img. 9). With this step, SD will be able to recondition the background, understand where shadows should be painted given the context of the present characters, while keeping the characters intact. I found that a mask blur of 1 or 2 works best at smoothing out the edges without ruining the intricate hands/hair strands, and a denoising strength of no more than 0.5, since it loves adding random characters at higher values. Given the low denoising strength, you can put the output image as the input and redo this step to enhance the shadows and make them more prominent. The reason that in step 3, you must remove as many fuzzy edges as possible, is that the fuzzy edges suggest to SD that the character is glowing or in a foggy environment, introducing fog around the subjects that is bad for the image. Voila.

For those of you who are still here, my goodness I admire your patience. I hope you enjoyed my 3-day journey on creating this painting, let me know if you have any suggestions, or share the masterpiece you created if this helps in any way. I have never created something that I am this proud of, and I am most thankful for the endless possibilities SD and ControlNet are enabling us now, and this is just the beginning. Happy generating.