arxiv:2412.01339

Negative Token Merging: Image-based Adversarial Feature Guidance

Published on Dec 2

· Submitted by

jsingh on Dec 6

Upvote

Authors:

Jaskirat Singh ,

Weijia Shi ,

Ranjay Krishna ,

Yejin Choi ,

Pang Wei Koh ,

Stephen Gould ,

Liang Zheng ,

Abstract

Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at https://negtome.github.io

View arXiv page View PDF Add to collection

Community

jsingh

Paper author Paper submitter 18 days ago

TLDR: Text-based adversarial guidance using a negative-prompt has emerged as a widely adopted approach for avoiding generation of undesired concepts. However, capturing complex visual concepts using text-alone is often not feasible or simply insufficient (e.g., for removing copyrighted characters). Furthermore, using a negative prompt itself might be not natively supported when using state-of-the-art guidance distilled models like Flux.

We propose negative token merging (NegToMe), which proposes to perform adversarial guidance directly using images instead of text alone. The key idea is simple: even if describing the undesired concepts is not effective or feasible in text, we can directly use the visual features from a reference image to adversarially guide the generation process.

Applications: By simply adjusting the reference image, NegToMe allows for a range of custom applications such as:

Increase Output Diversity: using other images in same batch as reference improves output diversity (by guiding visual features of each image away from others during reverse diffusion)
Copyright Mitigation: using a copyright retrieval database as reference reduces similarity to copyrighted content by 34.57%
Increase Output Quality: using a blurry reference image improves output aesthetics and details by guiding the outputs away from low quality features.
Object feature interpolation or extrapolation: by guiding the outputs towards or away from the provided reference image.

Paper: https://arxiv.org/pdf/2412.01339
Project Page and Demo 🤗: https://negtome.github.io/
Code: https://github.com/1jsingh/negtome

oguzhanercan

18 days ago

Unfortunately your method weakening prompt - generation alignment. I want realistic photos, so the prompt start with "Realistic, real life photo of person, ultra realistic facial details." but your model has a bias to generate unrealistic, artistic or sketch photos, but the base model handles it better.

jsingh

Paper author Paper submitter 18 days ago

•

edited 18 days ago

Thanks for your interest!

NegToMe actually uses the same base model while just adding a small negtome module after attention calculation.
You can also easily control the degree of such artistic changes by controlling the threshold or alpha (as in the demo)

Can you please try the online demo? Here is the result we got for the same prompt (fixed seed 0): "Realistic, real life photo of person, ultra realistic facial details."

NegToMe gets much better diversity (ethnic, age, background, pose) while still maintaining high realism!

oguzhanercan

18 days ago

What I mean with base model was w/o your method. I was using your online demo and when I decrease and incerease the alpha ( 0.6 ->1.3/0.6 -> 0.2), here are the result.

alpha : 1.3

alpha: 0.2

The problem still appears (for 0.2, first and second photo) but not as much as before. Here is my full prompt. Thanks for feedback about alpha.

Realistic, real life photo of person. The person in the image is a Male.The individual depicted in the image has short, brown hair. His eyes are dark brown, with a subtle crease visible in the eye's inner corner, suggesting a slight droop. The eyebrows are dark brown and neatly groomed, with a slight arch at the outer edges. The nose is straight and proportional to the rest of the face. The individual's skin tone is light brown, with a smooth texture and a subtle sheen. The facial structure appears to be relatively symmetrical, with a straight jawline and a gentle curve to the cheekbones. The individual's age is mid-to-late 30s or early 40s.

jsingh

Paper author Paper submitter 18 days ago

It seems you might be using a very high alpha. Please think of alpha as an easy to use parameter for controlling diversity. For high values of alpha, therefore you are essentially prioritizing diversity too much (leading to diverse styles, poses etc) as in the example you shared.

Here is the result we got for the same prompt (fixed seed 0):

The second image is actually more realistic while also improving diversity!

Also, if you like you can also use a higher threshold to preserve more details.

P.S. Please also note that the provided demo is for diversity. Since you are over-specifying all facial details like gender, hair, skin color, face shape, cheekbones, skin shine etc, the only diversity is limited to subtle variations in face structure.

For instance, here is a result for a generic realistic photo prompt without over-specification: a realistic photo of a man in suit posing, high resolution, real life photo

NegToMe gets much better diversity (age, pose, scale, suit and tie color, background etc) while still maintaining high realism!