What libraries can I use for Image-to-Image?

The diffusers, transformers, and transformers.js libraries are compatible with Image-to-Image.

What models can I use for Image-to-Image?

The fal/AuraSR-v2, black-forest-labs/FLUX.1-Kontext-dev, yisol/IDM-VTON, kontext-community/relighting-kontext-dev-lora-v3, black-forest-labs/FLUX.1-Fill-dev, and black-forest-labs/FLUX.1-Depth-dev-lora models can be used for Image-to-Image.

What datasets can I use for Image-to-Image?

The VIDIT, huggan/CelebA-faces, and Spawning/PD12M datasets can be used for Image-to-Image.

What metrics can I use for Image-to-Image?

The PSNR, SSIM, and IS metrics can be used for Image-to-Image.

Tasks

Image-to-Image

Image-to-image is the task of transforming an input image through a variety of possible manipulations and enhancements, such as super-resolution, image inpainting, colorization, and more.

Inputs

Image-to-Image Model

Output

About Image-to-Image

Image-to-image pipelines can also be used in text-to-image tasks, to provide visual guidance to the text-guided generation process.

Use Cases

Image inpainting

Image inpainting is widely used during photography editing to remove unwanted objects, such as poles, wires, or sensor dust.

Image colorization

Old or black and white images can be brought up to life using an image colorization model.

Super Resolution

Super-resolution models increase the resolution of an image, allowing for higher-quality viewing and printing.

Inference

You can use pipelines for image-to-image in 🧨diffusers library to easily use image-to-image models. See an example for StableDiffusionImg2ImgPipeline below.

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)

# this helps us to reduce memory usage- since SDXL is a bit heavy, this could help by
# offloading the model to CPU w/o hurting performance.
pipeline.enable_model_cpu_offload()

# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

# pass prompt and image to pipeline
image = pipeline(prompt, image=init_image, strength=0.5).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

You can use huggingface.js to infer image-to-image models on Hugging Face Hub.

import { InferenceClient } from "@huggingface/inference";

const inference = new InferenceClient(HF_TOKEN);
await inference.imageToImage({
    data: await (await fetch("image")).blob(),
    model: "timbrooks/instruct-pix2pix",
    parameters: {
        prompt: "Deblur this image",
    },
});

Uses Cases for Text Guided Image Generation

Style Transfer

One of the most popular use cases of image-to-image is style transfer. With style transfer models:

a regular photo can be transformed into a variety of artistic styles or genres, such as a watercolor painting, a comic book illustration and more.
new images can be generated using a text prompt, in the style of a reference input image.

See 🧨diffusers example for style transfer with AutoPipelineForText2Image below.

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

# load pipeline
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")

# set the adapter and scales - this is a component that lets us add the style control from an image to the text-to-image model
scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg")

generator = torch.Generator(device="cpu").manual_seed(26)
image = pipeline(
    prompt="a cat, masterpiece, best quality, high quality",
    ip_adapter_image=style_image,
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
    num_inference_steps=30,
    generator=generator,
).images[0]
image

ControlNet

Controlling the outputs of diffusion models only with a text prompt is a challenging problem. ControlNet is a neural network model that provides image-based control to diffusion models. Control images can be edges or other landmarks extracted from a source image.

Pix2Pix

Pix2Pix is a popular model used for image-to-image translation tasks. It is based on a conditional-GAN (generative adversarial network) where instead of a noise vector a 2D image is given as input. More information about Pix2Pix can be retrieved from this link where the associated paper and the GitHub repository can be found.

The images below show some examples extracted from the Pix2Pix paper. This model can be applied to various use cases. It is capable of relatively simpler things, e.g., converting a grayscale image to its colored version. But more importantly, it can generate realistic pictures from rough sketches (can be seen in the purse example) or from painting-like images (can be seen in the street and facade examples below).

Useful Resources

Image-to-image guide with diffusers
Image inpainting: inpainting with 🧨diffusers, demo
Colorization: demo
Super resolution: image upscaling with 🧨diffusers, demo
Style transfer and layout control with diffusers 🧨
Train your ControlNet with diffusers 🧨
Ultra fast ControlNet with 🧨 Diffusers
List of ControlNets trained in the community JAX Diffusers sprint

References

[1] P. Isola, J. -Y. Zhu, T. Zhou and A. A. Efros, "Image-to-Image Translation with Conditional Adversarial Networks," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5967-5976, doi: 10.1109/CVPR.2017.632.

This page was made possible thanks to the efforts of Paul Gafton and Osman Alenbey.

Compatible libraries

Diffusers

Transformers

Transformers.js

using Qwen/Qwen-Image

Models for Image-to-Image

Browse Models (2,071)

fal/AuraSR-v2

Image-to-Image • Updated Aug 7, 2024 • 770 • 321

Note An image-to-image model to improve image resolution.

black-forest-labs/FLUX.1-Kontext-dev

Image-to-Image • Updated Jan 1 • 117k • • 2.54k

Note Powerful image editing model.

yisol/IDM-VTON

Image-to-Image • Updated Apr 22, 2024 • 12.3k • 696

Note Virtual try-on model.

kontext-community/relighting-kontext-dev-lora-v3

Image-to-Image • Updated Jul 4, 2025 • 727 • • 75

Note Image re-lighting model.

black-forest-labs/FLUX.1-Fill-dev

Updated Jun 27, 2025 • 149k • 990

Note Strong model for inpainting and outpainting.

black-forest-labs/FLUX.1-Depth-dev-lora

Updated Jun 27, 2025 • 7.14k • 236

Note Strong model for image editing using depth maps.

Datasets for Image-to-Image

Browse Datasets (760)

Spawning/PD12M

Viewer • Updated Jan 9, 2025 • 12.4M • 997 • 169

Note 12M image-caption pairs.

Spaces using Image-to-Image

⚡

black-forest-labs/FLUX.1-Kontext-Dev

Note Image editing application.

📈

lllyasviel/iclight-v2-vary

Note Image relighting application.

🔎

jasperai/Flux.1-dev-Controlnet-Upscaler

Note An application for image upscaling.

Metrics for Image-to-Image

PSNR: Peak Signal to Noise Ratio (PSNR) is an approximation of the human perception, considering the ratio of the absolute intensity with respect to the variations. Measured in dB, a high value indicates a high fidelity.

SSIM: Structural Similarity Index (SSIM) is a perceptual metric which compares the luminance, contrast and structure of two images. The values of SSIM range between -1 and 1, and higher values indicate closer resemblance to the original image.

IS: Inception Score (IS) is an analysis of the labels predicted by an image classification model when presented with a sample of the generated images.