Spaces:

Roopansh
/

Ailusion-VTON-DEMO-v1

Running on Zero

App Files Files Community

Roopansh commited on May 16, 2024

Commit

f4e0db4

1 Parent(s): 347f4d6

New Update

Browse files

Files changed (41) hide show

.gitattributes +1 -4
README-2.md +0 -162
README.md +8 -5
app.py +3 -7
apply_net.py +1 -1
assets/teaser.png +0 -3
assets/teaser2.png +0 -3
environment.yaml +0 -25
humanparsing/parsing_atr.onnx +0 -3
humanparsing/parsing_lip.onnx +0 -3
image_encoder/config.json +0 -23
image_encoder/model.safetensors +0 -3
inference.py +0 -425
inference.sh +0 -34
inference_dc.py +0 -578
openpose/ckpts/body_pose_model.pth +0 -3
requirements.txt +4 -4
scheduler/scheduler_config.json +0 -19
text_encoder/config.json +0 -25
text_encoder/model.safetensors +0 -3
text_encoder_2/config.json +0 -25
text_encoder_2/model.safetensors +0 -3
tokenizer/merges.txt +0 -0
tokenizer/special_tokens_map.json +0 -24
tokenizer/tokenizer_config.json +0 -33
tokenizer/vocab.json +0 -0
tokenizer_2/merges.txt +0 -0
tokenizer_2/special_tokens_map.json +0 -24
tokenizer_2/tokenizer_config.json +0 -33
tokenizer_2/vocab.json +0 -0
unet/config.json +0 -78
unet/diffusion_pytorch_model.bin +0 -3
unet_encoder/config.json +0 -68
unet_encoder/diffusion_pytorch_model.safetensors +0 -3
util/common.py +0 -8
util/image.py +0 -37
util/pipeline.py +0 -88
utils_mask.py +1 -1
vae/config.json +0 -32
vae/diffusion_pytorch_model.safetensors +0 -3
vitonhd_test_tagged.json +0 -0

.gitattributes CHANGED Viewed

@@ -32,7 +32,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text
-ckpt/** filter=lfs diff=lfs merge=lfs -text
-assets/teaser.png filter=lfs diff=lfs merge=lfs -text
-assets/teaser2.png filter=lfs diff=lfs merge=lfs -text

 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README-2.md DELETED Viewed

@@ -1,162 +0,0 @@
-<div align="center">
-<h1>IDM-VTON: Improving Diffusion Models for Authentic Virtual Try-on in the Wild</h1>
-<a href='https://idm-vton.github.io'><img src='https://img.shields.io/badge/Project-Page-green'></a>
-<a href='https://arxiv.org/abs/2403.05139'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
-<a href='https://huggingface.co/spaces/yisol/IDM-VTON'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'></a>
-<a href='https://huggingface.co/yisol/IDM-VTON'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
-</div>
-This is the official implementation of the paper ["Improving Diffusion Models for Authentic Virtual Try-on in the Wild"](https://arxiv.org/abs/2403.05139).
-Star ⭐ us if you like it!
----
-<!-- ![teaser2](assets/teaser2.png)&nbsp;
-![teaser](assets/teaser.png)&nbsp; -->
-## TODO LIST
-- [x] demo model
-- [x] inference code
-- [ ] training code
-## Requirements
-```
-git clone https://github.com/yisol/IDM-VTON.git
-cd IDM-VTON
-conda env create -f environment.yaml
-conda activate idm
-```
-## Data preparation
-### VITON-HD
-You can download VITON-HD dataset from [VITON-HD](https://github.com/shadow2496/VITON-HD).
-After download VITON-HD dataset, move vitonhd_test_tagged.json into the test folder.
-Structure of the Dataset directory should be as follows.
-```
-train
-|-- ...
-test
-|-- image
-|-- image-densepose
-|-- agnostic-mask
-|-- cloth
-|-- vitonhd_test_tagged.json
-```
-### DressCode
-You can download DressCode dataset from [DressCode](https://github.com/aimagelab/dress-code).
-We provide pre-computed densepose images and captions for garments [here](https://kaistackr-my.sharepoint.com/:u:/g/personal/cpis7_kaist_ac_kr/EaIPRG-aiRRIopz9i002FOwBDa-0-BHUKVZ7Ia5yAVVG3A?e=YxkAip).
-We used [detectron2](https://github.com/facebookresearch/detectron2) for obtaining densepose images, refer [here](https://github.com/sangyun884/HR-VITON/issues/45) for more details.
-After download the DressCode dataset, place image-densepose directories and caption text files as follows.
-```
-DressCode
-|-- dresses
-    |-- images
-    |-- image-densepose
-    |-- dc_caption.txt
-    |-- ...
-|-- lower_body
-    |-- images
-    |-- image-densepose
-    |-- dc_caption.txt
-    |-- ...
-|-- upper_body
-    |-- images
-    |-- image-densepose
-    |-- dc_caption.txt
-    |-- ...
-```
-## Inference
-### VITON-HD
-Inference using python file with arguments,
-```
-accelerate launch inference.py \
-    --width 768 --height 1024 --num_inference_steps 30 \
-    --output_dir "result" \
-    --unpaired \
-    --data_dir "DATA_DIR" \
-    --seed 42 \
-    --test_batch_size 2 \
-    --guidance_scale 2.0
-```
-or, you can simply run with the script file.
-```
-sh inference.sh
-```
-### DressCode
-For DressCode dataset, put the category you want to generate images via category argument,
-```
-accelerate launch inference_dc.py \
-    --width 768 --height 1024 --num_inference_steps 30 \
-    --output_dir "result" \
-    --unpaired \
-    --data_dir "DATA_DIR" \
-    --seed 42
-    --test_batch_size 2
-    --guidance_scale 2.0
-    --category "upper_body"
-```
-or, you can simply run with the script file.
-```
-sh inference.sh
-```
-## Acknowledgements
-For the [demo](https://huggingface.co/spaces/yisol/IDM-VTON), GPUs are supported from [ZeroGPU](https://huggingface.co/zero-gpu-explorers), and masking generation codes are based on [OOTDiffusion](https://github.com/levihsu/OOTDiffusion) and [DCI-VTON](https://github.com/bcmi/DCI-VTON-Virtual-Try-On).
-Parts of our code are based on [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter).
-## Citation
-```
-@article{choi2024improving,
-  title={Improving Diffusion Models for Virtual Try-on},
-  author={Choi, Yisol and Kwak, Sangkyung and Lee, Kyungmin and Choi, Hyungwon and Shin, Jinwoo},
-  journal={arXiv preprint arXiv:2403.05139},
-  year={2024}
-}
-```
-## License
-The codes and checkpoints in this repository are under the [CC BY-NC-SA 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode).

README.md CHANGED Viewed

@@ -1,11 +1,14 @@
 ---
-title: AILUSION VTON DEMO V1
-colorForm: yellow
-colorTo: green
 sdk: gradio
-sdk_version: 4.28.2
 app_file: app.py
 pinned: false
 ---
-AILUSION V1 DEMO Virtual Try ON

 ---
+title: AILUSION VTON DEMO
+emoji: 👕👔👚
+colorFrom: yellow
+colorTo: red
 sdk: gradio
+sdk_version: 4.24.0
 app_file: app.py
 pinned: false
+license: cc-by-nc-sa-4.0
+short_description: High-fidelity Virtual Try-on
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py CHANGED Viewed

@@ -40,7 +40,7 @@ def pil_to_binary_mask(pil_image, threshold=0):
     return output_mask
-base_path = 'Roopansh/Ailusion-VTON-DEMO-v1.1'
 example_path = os.path.join(os.path.dirname(__file__), 'example')
 unet = UNet2DConditionModel.from_pretrained(
@@ -88,8 +88,6 @@ UNet_Encoder = UNet2DConditionModel_ref.from_pretrained(
     base_path,
     subfolder="unet_encoder",
     torch_dtype=torch.float16,
-    load_in_8bit=True,
-    max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB'
 )
 parsing_model = Parsing(0)
@@ -122,9 +120,8 @@ pipe = TryonPipeline.from_pretrained(
         torch_dtype=torch.float16,
 )
 pipe.unet_encoder = UNet_Encoder
-pipe.to("cuda")
-@spaces.GPU(duration=120)
 def start_tryon(dict,garm_img,garment_des,is_checked,is_checked_crop,denoise_steps,seed):
     device = "cuda"
@@ -263,7 +260,7 @@ for ex_human in human_list_path:
 image_blocks = gr.Blocks().queue()
 with image_blocks as demo:
-    gr.Markdown("## AILUSION VTON 👕👔👚")
     gr.Markdown("Virtual Try-on with your image and garment image. Check out the [source codes](https://github.com/yisol/IDM-VTON) and the [model](https://huggingface.co/yisol/IDM-VTON)")
     with gr.Row():
         with gr.Column():
@@ -313,4 +310,3 @@ with image_blocks as demo:
 image_blocks.launch()

     return output_mask
+base_path = 'yisol/IDM-VTON'
 example_path = os.path.join(os.path.dirname(__file__), 'example')
 unet = UNet2DConditionModel.from_pretrained(
     base_path,
     subfolder="unet_encoder",
     torch_dtype=torch.float16,
 )
 parsing_model = Parsing(0)
         torch_dtype=torch.float16,
 )
 pipe.unet_encoder = UNet_Encoder
+@spaces.GPU
 def start_tryon(dict,garm_img,garment_des,is_checked,is_checked_crop,denoise_steps,seed):
     device = "cuda"
 image_blocks = gr.Blocks().queue()
 with image_blocks as demo:
+    gr.Markdown("## IDM-VTON 👕👔👚")
     gr.Markdown("Virtual Try-on with your image and garment image. Check out the [source codes](https://github.com/yisol/IDM-VTON) and the [model](https://huggingface.co/yisol/IDM-VTON)")
     with gr.Row():
         with gr.Column():
 image_blocks.launch()

apply_net.py CHANGED Viewed

@@ -356,4 +356,4 @@ if __name__ == "__main__":
     main()
-# python ./apply_net.py show ./configs/densepose_rcnn_R_50_FPN_s1x.yaml https://dl.fbaipublicfiles.com/densepose/densepose_rcnn_R_50_FPN_s1x/165712039/model_final_162be9.pkl /home/alin0222/Dresscode/dresses/humanonly dp_segm -v --opts MODEL.DEVICE cuda


356	main()
357
358
359	+ # python ./apply_net.py show ./configs/densepose_rcnn_R_50_FPN_s1x.yaml https://dl.fbaipublicfiles.com/densepose/densepose_rcnn_R_50_FPN_s1x/165712039/model_final_162be9.pkl /home/alin0222/Dresscode/dresses/humanonly dp_segm -v --opts MODEL.DEVICE cuda

assets/teaser.png DELETED Viewed

Git LFS Details

SHA256: e0ff5c96023ddf67864dc49acde2fab6a0c982fd77aa4979d9a2e77f45ad0b82
Pointer size: 132 Bytes
Size of remote file: 7.06 MB

assets/teaser2.png DELETED Viewed

Git LFS Details

SHA256: 4a2c3522cb7805407f437f1639418166477f334cbef739e06947b5dfc68a1968
Pointer size: 132 Bytes
Size of remote file: 9.02 MB

environment.yaml DELETED Viewed

@@ -1,25 +0,0 @@
-name: idm
-channels:
-  - pytorch
-  - nvidia
-  - defaults
-dependencies:
-  - python=3.10.0=h12debd9_5
-  - pytorch=2.0.1=py3.10_cuda11.8_cudnn8.7.0_0
-  - pytorch-cuda=11.8=h7e8668a_5
-  - torchaudio=2.0.2=py310_cu118
-  - torchtriton=2.0.0=py310
-  - torchvision=0.15.2=py310_cu118
-  - pip=23.3.1=py310h06a4308_0
-  - pip:
-      - accelerate==0.25.0
-      - torchmetrics==1.2.1
-      - tqdm==4.66.1
-      - transformers==4.36.2
-      - diffusers==0.25.0
-      - einops==0.7.0
-      - bitsandbytes==0.39.0
-      - scipy==1.11.1
-      - opencv-python
-      - spaces

humanparsing/parsing_atr.onnx DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:04c7d1d070d0e0ae943d86b18cb5aaaea9e278d97462e9cfb270cbbe4cd977f4
-size 266859305

humanparsing/parsing_lip.onnx DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:8436e1dae96e2601c373d1ace29c8f0978b16357d9038c17a8ba756cca376dbc
-size 266863411

image_encoder/config.json DELETED Viewed

@@ -1,23 +0,0 @@
-{
-  "_name_or_path": "./image_encoder",
-  "architectures": [
-    "CLIPVisionModelWithProjection"
-  ],
-  "attention_dropout": 0.0,
-  "dropout": 0.0,
-  "hidden_act": "gelu",
-  "hidden_size": 1280,
-  "image_size": 224,
-  "initializer_factor": 1.0,
-  "initializer_range": 0.02,
-  "intermediate_size": 5120,
-  "layer_norm_eps": 1e-05,
-  "model_type": "clip_vision_model",
-  "num_attention_heads": 16,
-  "num_channels": 3,
-  "num_hidden_layers": 32,
-  "patch_size": 14,
-  "projection_dim": 1024,
-  "torch_dtype": "float16",
-  "transformers_version": "4.28.0.dev0"
-}

image_encoder/model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6ca9667da1ca9e0b0f75e46bb030f7e011f44f86cbfb8d5a36590fcd7507b030
-size 2528373448

inference.py DELETED Viewed

@@ -1,425 +0,0 @@
-# coding=utf-8
-# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union, Literal
-from ip_adapter.ip_adapter import Resampler
-import argparse
-import logging
-import os
-import torch.utils.data as data
-import torchvision
-import json
-import accelerate
-import numpy as np
-import torch
-from PIL import Image
-import torch.nn.functional as F
-import transformers
-from accelerate import Accelerator
-from accelerate.logging import get_logger
-from accelerate.utils import ProjectConfiguration, set_seed
-from packaging import version
-from torchvision import transforms
-import diffusers
-from diffusers import AutoencoderKL, DDPMScheduler, StableDiffusionPipeline, StableDiffusionXLControlNetInpaintPipeline
-from transformers import AutoTokenizer, PretrainedConfig,CLIPImageProcessor, CLIPVisionModelWithProjection,CLIPTextModelWithProjection, CLIPTextModel, CLIPTokenizer
-from diffusers.utils.import_utils import is_xformers_available
-from src.unet_hacked_tryon import UNet2DConditionModel
-from src.unet_hacked_garmnet import UNet2DConditionModel as UNet2DConditionModel_ref
-from src.tryon_pipeline import StableDiffusionXLInpaintPipeline as TryonPipeline
-logger = get_logger(__name__, log_level="INFO")
-def parse_args():
-    parser = argparse.ArgumentParser(description="Simple example of a training script.")
-    parser.add_argument("--pretrained_model_name_or_path",type=str,default= "yisol/IDM-VTON",required=False,)
-    parser.add_argument("--width",type=int,default=768,)
-    parser.add_argument("--height",type=int,default=1024,)
-    parser.add_argument("--num_inference_steps",type=int,default=30,)
-    parser.add_argument("--output_dir",type=str,default="result",)
-    parser.add_argument("--unpaired",action="store_true",)
-    parser.add_argument("--data_dir",type=str,default="/home/omnious/workspace/yisol/Dataset/zalando")
-    parser.add_argument("--seed", type=int, default=42,)
-    parser.add_argument("--test_batch_size", type=int, default=2,)
-    parser.add_argument("--guidance_scale",type=float,default=2.0,)
-    parser.add_argument("--mixed_precision",type=str,default=None,choices=["no", "fp16", "bf16"],)
-    parser.add_argument("--enable_xformers_memory_efficient_attention", action="store_true", help="Whether or not to use xformers.")
-    args = parser.parse_args()
-    return args
-def pil_to_tensor(images):
-    images = np.array(images).astype(np.float32) / 255.0
-    images = torch.from_numpy(images.transpose(2, 0, 1))
-    return images
-class VitonHDTestDataset(data.Dataset):
-    def __init__(
-        self,
-        dataroot_path: str,
-        phase: Literal["train", "test"],
-        order: Literal["paired", "unpaired"] = "paired",
-        size: Tuple[int, int] = (512, 384),
-    ):
-        super(VitonHDTestDataset, self).__init__()
-        self.dataroot = dataroot_path
-        self.phase = phase
-        self.height = size[0]
-        self.width = size[1]
-        self.size = size
-        self.transform = transforms.Compose(
-            [
-                transforms.ToTensor(),
-                transforms.Normalize([0.5], [0.5]),
-            ]
-        )
-        self.toTensor = transforms.ToTensor()
-        with open(
-            os.path.join(dataroot_path, phase, "vitonhd_" + phase + "_tagged.json"), "r"
-        ) as file1:
-            data1 = json.load(file1)
-        annotation_list = [
-            "sleeveLength",
-            "neckLine",
-            "item",
-        ]
-        self.annotation_pair = {}
-        for k, v in data1.items():
-            for elem in v:
-                annotation_str = ""
-                for template in annotation_list:
-                    for tag in elem["tag_info"]:
-                        if (
-                            tag["tag_name"] == template
-                            and tag["tag_category"] is not None
-                        ):
-                            annotation_str += tag["tag_category"]
-                            annotation_str += " "
-                self.annotation_pair[elem["file_name"]] = annotation_str
-        self.order = order
-        self.toTensor = transforms.ToTensor()
-        im_names = []
-        c_names = []
-        dataroot_names = []
-        if phase == "train":
-            filename = os.path.join(dataroot_path, f"{phase}_pairs.txt")
-        else:
-            filename = os.path.join(dataroot_path, f"{phase}_pairs.txt")
-        with open(filename, "r") as f:
-            for line in f.readlines():
-                if phase == "train":
-                    im_name, _ = line.strip().split()
-                    c_name = im_name
-                else:
-                    if order == "paired":
-                        im_name, _ = line.strip().split()
-                        c_name = im_name
-                    else:
-                        im_name, c_name = line.strip().split()
-                im_names.append(im_name)
-                c_names.append(c_name)
-                dataroot_names.append(dataroot_path)
-        self.im_names = im_names
-        self.c_names = c_names
-        self.dataroot_names = dataroot_names
-        self.clip_processor = CLIPImageProcessor()
-    def __getitem__(self, index):
-        c_name = self.c_names[index]
-        im_name = self.im_names[index]
-        if c_name in self.annotation_pair:
-            cloth_annotation = self.annotation_pair[c_name]
-        else:
-            cloth_annotation = "shirts"
-        cloth = Image.open(os.path.join(self.dataroot, self.phase, "cloth", c_name))
-        im_pil_big = Image.open(
-            os.path.join(self.dataroot, self.phase, "image", im_name)
-        ).resize((self.width,self.height))
-        image = self.transform(im_pil_big)
-        mask = Image.open(os.path.join(self.dataroot, self.phase, "agnostic-mask", im_name.replace('.jpg','_mask.png'))).resize((self.width,self.height))
-        mask = self.toTensor(mask)
-        mask = mask[:1]
-        mask = 1-mask
-        im_mask = image * mask
-        pose_img = Image.open(
-            os.path.join(self.dataroot, self.phase, "image-densepose", im_name)
-        )
-        pose_img = self.transform(pose_img)  # [-1,1]
-        result = {}
-        result["c_name"] = c_name
-        result["im_name"] = im_name
-        result["image"] = image
-        result["cloth_pure"] = self.transform(cloth)
-        result["cloth"] = self.clip_processor(images=cloth, return_tensors="pt").pixel_values
-        result["inpaint_mask"] =1-mask
-        result["im_mask"] = im_mask
-        result["caption_cloth"] = "a photo of " + cloth_annotation
-        result["caption"] = "model is wearing a " + cloth_annotation
-        result["pose_img"] = pose_img
-        return result
-    def __len__(self):
-        # model images + cloth image
-        return len(self.im_names)
-def main():
-    args = parse_args()
-    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir)
-    accelerator = Accelerator(
-        mixed_precision=args.mixed_precision,
-        project_config=accelerator_project_config,
-    )
-    if accelerator.is_local_main_process:
-        transformers.utils.logging.set_verbosity_warning()
-        diffusers.utils.logging.set_verbosity_info()
-    else:
-        transformers.utils.logging.set_verbosity_error()
-        diffusers.utils.logging.set_verbosity_error()
-    # If passed along, set the training seed now.
-    if args.seed is not None:
-        set_seed(args.seed)
-    # Handle the repository creation
-    if accelerator.is_main_process:
-        if args.output_dir is not None:
-            os.makedirs(args.output_dir, exist_ok=True)
-    weight_dtype = torch.float16
-    # if accelerator.mixed_precision == "fp16":
-    #     weight_dtype = torch.float16
-    #     args.mixed_precision = accelerator.mixed_precision
-    # elif accelerator.mixed_precision == "bf16":
-    #     weight_dtype = torch.bfloat16
-    #     args.mixed_precision = accelerator.mixed_precision
-    # Load scheduler, tokenizer and models.
-    noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
-    vae = AutoencoderKL.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="vae",
-        torch_dtype=torch.float16,
-    )
-    unet = UNet2DConditionModel.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="unet",
-        torch_dtype=torch.float16,
-    )
-    image_encoder = CLIPVisionModelWithProjection.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="image_encoder",
-        torch_dtype=torch.float16,
-    )
-    UNet_Encoder = UNet2DConditionModel_ref.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="unet_encoder",
-        torch_dtype=torch.float16,
-    )
-    text_encoder_one = CLIPTextModel.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="text_encoder",
-        torch_dtype=torch.float16,
-    )
-    text_encoder_two = CLIPTextModelWithProjection.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="text_encoder_2",
-        torch_dtype=torch.float16,
-    )
-    tokenizer_one = AutoTokenizer.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="tokenizer",
-        revision=None,
-        use_fast=False,
-    )
-    tokenizer_two = AutoTokenizer.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="tokenizer_2",
-        revision=None,
-        use_fast=False,
-    )
-    # Freeze vae and text_encoder and set unet to trainable
-    unet.requires_grad_(False)
-    vae.requires_grad_(False)
-    image_encoder.requires_grad_(False)
-    UNet_Encoder.requires_grad_(False)
-    text_encoder_one.requires_grad_(False)
-    text_encoder_two.requires_grad_(False)
-    UNet_Encoder.to(accelerator.device, weight_dtype)
-    unet.eval()
-    UNet_Encoder.eval()
-    if args.enable_xformers_memory_efficient_attention:
-        if is_xformers_available():
-            import xformers
-            xformers_version = version.parse(xformers.__version__)
-            if xformers_version == version.parse("0.0.16"):
-                logger.warn(
-                    "xFormers 0.0.16 cannot be used for training in some GPUs. If you observe problems during training, please update xFormers to at least 0.0.17. See https://huggingface.co/docs/diffusers/main/en/optimization/xformers for more details."
-                )
-            unet.enable_xformers_memory_efficient_attention()
-        else:
-            raise ValueError("xformers is not available. Make sure it is installed correctly")
-    test_dataset = VitonHDTestDataset(
-        dataroot_path=args.data_dir,
-        phase="test",
-        order="unpaired" if args.unpaired else "paired",
-        size=(args.height, args.width),
-    )
-    test_dataloader = torch.utils.data.DataLoader(
-        test_dataset,
-        shuffle=False,
-        batch_size=args.test_batch_size,
-        num_workers=4,
-    )
-    pipe = TryonPipeline.from_pretrained(
-            args.pretrained_model_name_or_path,
-            unet=unet,
-            vae=vae,
-            feature_extractor= CLIPImageProcessor(),
-            text_encoder = text_encoder_one,
-            text_encoder_2 = text_encoder_two,
-            tokenizer = tokenizer_one,
-            tokenizer_2 = tokenizer_two,
-            scheduler = noise_scheduler,
-            image_encoder=image_encoder,
-            torch_dtype=torch.float16,
-    ).to(accelerator.device)
-    pipe.unet_encoder = UNet_Encoder
-    # pipe.enable_sequential_cpu_offload()
-    # pipe.enable_model_cpu_offload()
-    # pipe.enable_vae_slicing()
-    with torch.no_grad():
-        # Extract the images
-        with torch.cuda.amp.autocast():
-            with torch.no_grad():
-                for sample in test_dataloader:
-                    img_emb_list = []
-                    for i in range(sample['cloth'].shape[0]):
-                        img_emb_list.append(sample['cloth'][i])
-                    prompt = sample["caption"]
-                    num_prompts = sample['cloth'].shape[0]
-                    negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
-                    if not isinstance(prompt, List):
-                        prompt = [prompt] * num_prompts
-                    if not isinstance(negative_prompt, List):
-                        negative_prompt = [negative_prompt] * num_prompts
-                    image_embeds = torch.cat(img_emb_list,dim=0)
-                    with torch.inference_mode():
-                        (
-                            prompt_embeds,
-                            negative_prompt_embeds,
-                            pooled_prompt_embeds,
-                            negative_pooled_prompt_embeds,
-                        ) = pipe.encode_prompt(
-                            prompt,
-                            num_images_per_prompt=1,
-                            do_classifier_free_guidance=True,
-                            negative_prompt=negative_prompt,
-                        )
-                        prompt = sample["caption_cloth"]
-                        negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
-                        if not isinstance(prompt, List):
-                            prompt = [prompt] * num_prompts
-                        if not isinstance(negative_prompt, List):
-                            negative_prompt = [negative_prompt] * num_prompts
-                        with torch.inference_mode():
-                            (
-                                prompt_embeds_c,
-                                _,
-                                _,
-                                _,
-                            ) = pipe.encode_prompt(
-                                prompt,
-                                num_images_per_prompt=1,
-                                do_classifier_free_guidance=False,
-                                negative_prompt=negative_prompt,
-                            )
-                        generator = torch.Generator(pipe.device).manual_seed(args.seed) if args.seed is not None else None
-                        images = pipe(
-                            prompt_embeds=prompt_embeds,
-                            negative_prompt_embeds=negative_prompt_embeds,
-                            pooled_prompt_embeds=pooled_prompt_embeds,
-                            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
-                            num_inference_steps=args.num_inference_steps,
-                            generator=generator,
-                            strength = 1.0,
-                            pose_img = sample['pose_img'],
-                            text_embeds_cloth=prompt_embeds_c,
-                            cloth = sample["cloth_pure"].to(accelerator.device),
-                            mask_image=sample['inpaint_mask'],
-                            image=(sample['image']+1.0)/2.0,
-                            height=args.height,
-                            width=args.width,
-                            guidance_scale=args.guidance_scale,
-                            ip_adapter_image = image_embeds,
-                        )[0]
-                    for i in range(len(images)):
-                        x_sample = pil_to_tensor(images[i])
-                        torchvision.utils.save_image(x_sample,os.path.join(args.output_dir,sample['im_name'][i]))
-if __name__ == "__main__":
-    main()

inference.sh DELETED Viewed

@@ -1,34 +0,0 @@
-#VITON-HD
-##paired setting
-accelerate launch inference.py --pretrained_model_name_or_path "yisol/IDM-VTON" \
-    --width 768 --height 1024 --num_inference_steps 30 \
-    --output_dir "result" --data_dir "/home/omnious/workspace/yisol/Dataset/zalando" \
-    --seed 42 --test_batch_size 2 --guidance_scale 2.0
-##unpaired setting
-accelerate launch inference.py --pretrained_model_name_or_path "yisol/IDM-VTON" \
-    --width 768 --height 1024 --num_inference_steps 30 \
-    --output_dir "result" --unpaired --data_dir "/home/omnious/workspace/yisol/Dataset/zalando" \
-    --seed 42 --test_batch_size 2 --guidance_scale 2.0
-#DressCode
-##upper_body
-accelerate launch inference_dc.py --pretrained_model_name_or_path "yisol/IDM-VTON" \
-    --width 768 --height 1024 --num_inference_steps 30 \
-    --output_dir "result" --unpaired --data_dir "/home/omnious/workspace/yisol/DressCode" \
-    --seed 42 --test_batch_size 2 --guidance_scale 2.0 --category "upper_body"
-##lower_body
-accelerate launch inference_dc.py --pretrained_model_name_or_path "yisol/IDM-VTON" \
-    --width 768 --height 1024 --num_inference_steps 30 \
-    --output_dir "result" --unpaired --data_dir "/home/omnious/workspace/yisol/DressCode" \
-    --seed 42 --test_batch_size 2 --guidance_scale 2.0 --category "lower_body"
-##dresses
-accelerate launch inference_dc.py --pretrained_model_name_or_path "yisol/IDM-VTON" \
-    --width 768 --height 1024 --num_inference_steps 30 \
-    --output_dir "result" --unpaired --data_dir "/home/omnious/workspace/yisol/DressCode" \
-    --seed 42 --test_batch_size 2 --guidance_scale 2.0 --category "dresses"

inference_dc.py DELETED Viewed

@@ -1,578 +0,0 @@
-# coding=utf-8
-# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union, Literal
-from ip_adapter.ip_adapter import Resampler
-import argparse
-import logging
-import os
-import torch.utils.data as data
-import torchvision
-import json
-import accelerate
-import numpy as np
-import torch
-from PIL import Image, ImageDraw
-import torch.nn.functional as F
-import transformers
-from accelerate import Accelerator
-from accelerate.logging import get_logger
-from accelerate.utils import ProjectConfiguration, set_seed
-from packaging import version
-from torchvision import transforms
-import diffusers
-from diffusers import AutoencoderKL, DDPMScheduler, StableDiffusionPipeline, StableDiffusionXLControlNetInpaintPipeline
-from transformers import AutoTokenizer, PretrainedConfig,CLIPImageProcessor, CLIPVisionModelWithProjection,CLIPTextModelWithProjection, CLIPTextModel, CLIPTokenizer
-import cv2
-from diffusers.utils.import_utils import is_xformers_available
-from numpy.linalg import lstsq
-from src.unet_hacked_tryon import UNet2DConditionModel
-from src.unet_hacked_garmnet import UNet2DConditionModel as UNet2DConditionModel_ref
-from src.tryon_pipeline import StableDiffusionXLInpaintPipeline as TryonPipeline
-logger = get_logger(__name__, log_level="INFO")
-label_map={
-    "background": 0,
-    "hat": 1,
-    "hair": 2,
-    "sunglasses": 3,
-    "upper_clothes": 4,
-    "skirt": 5,
-    "pants": 6,
-    "dress": 7,
-    "belt": 8,
-    "left_shoe": 9,
-    "right_shoe": 10,
-    "head": 11,
-    "left_leg": 12,
-    "right_leg": 13,
-    "left_arm": 14,
-    "right_arm": 15,
-    "bag": 16,
-    "scarf": 17,
-}
-def parse_args():
-    parser = argparse.ArgumentParser(description="Simple example of a training script.")
-    parser.add_argument("--pretrained_model_name_or_path",type=str,default= "yisol/IDM-VTON",required=False,)
-    parser.add_argument("--width",type=int,default=768,)
-    parser.add_argument("--height",type=int,default=1024,)
-    parser.add_argument("--num_inference_steps",type=int,default=30,)
-    parser.add_argument("--output_dir",type=str,default="result",)
-    parser.add_argument("--category",type=str,default="upper_body",choices=["upper_body", "lower_body", "dresses"])
-    parser.add_argument("--unpaired",action="store_true",)
-    parser.add_argument("--data_dir",type=str,default="/home/omnious/workspace/yisol/Dataset/zalando")
-    parser.add_argument("--seed", type=int, default=42,)
-    parser.add_argument("--test_batch_size", type=int, default=2,)
-    parser.add_argument("--guidance_scale",type=float,default=2.0,)
-    parser.add_argument("--mixed_precision",type=str,default=None,choices=["no", "fp16", "bf16"],)
-    parser.add_argument("--enable_xformers_memory_efficient_attention", action="store_true", help="Whether or not to use xformers.")
-    args = parser.parse_args()
-    return args
-def pil_to_tensor(images):
-    images = np.array(images).astype(np.float32) / 255.0
-    images = torch.from_numpy(images.transpose(2, 0, 1))
-    return images
-class DresscodeTestDataset(data.Dataset):
-    def __init__(
-        self,
-        dataroot_path: str,
-        phase: Literal["train", "test"],
-        order: Literal["paired", "unpaired"] = "paired",
-        category = "upper_body",
-        size: Tuple[int, int] = (512, 384),
-    ):
-        super(DresscodeTestDataset, self).__init__()
-        self.dataroot = os.path.join(dataroot_path,category)
-        self.phase = phase
-        self.height = size[0]
-        self.width = size[1]
-        self.size = size
-        self.transform = transforms.Compose(
-            [
-                transforms.ToTensor(),
-                transforms.Normalize([0.5], [0.5]),
-            ]
-        )
-        self.toTensor = transforms.ToTensor()
-        self.order = order
-        self.radius = 5
-        self.category = category
-        im_names = []
-        c_names = []
-        if phase == "train":
-            filename = os.path.join(dataroot_path,category, f"{phase}_pairs.txt")
-        else:
-            filename = os.path.join(dataroot_path,category, f"{phase}_pairs_{order}.txt")
-        with open(filename, "r") as f:
-            for line in f.readlines():
-                im_name, c_name = line.strip().split()
-                im_names.append(im_name)
-                c_names.append(c_name)
-        file_path = os.path.join(dataroot_path,category,"dc_caption.txt")
-        self.annotation_pair = {}
-        with open(file_path, "r") as file:
-            for line in file:
-                parts = line.strip().split(" ")
-                self.annotation_pair[parts[0]] = ' '.join(parts[1:])
-        self.im_names = im_names
-        self.c_names = c_names
-        self.clip_processor = CLIPImageProcessor()
-    def __getitem__(self, index):
-        c_name = self.c_names[index]
-        im_name = self.im_names[index]
-        if c_name in self.annotation_pair:
-            cloth_annotation = self.annotation_pair[c_name]
-        else:
-            cloth_annotation = self.category
-        cloth = Image.open(os.path.join(self.dataroot, "images", c_name))
-        im_pil_big = Image.open(
-            os.path.join(self.dataroot, "images", im_name)
-        ).resize((self.width,self.height))
-        image = self.transform(im_pil_big)
-        skeleton = Image.open(os.path.join(self.dataroot, 'skeletons', im_name.replace("_0", "_5")))
-        skeleton = skeleton.resize((self.width, self.height))
-        skeleton = self.transform(skeleton)
-        # Label Map
-        parse_name = im_name.replace('_0.jpg', '_4.png')
-        im_parse = Image.open(os.path.join(self.dataroot, 'label_maps', parse_name))
-        im_parse = im_parse.resize((self.width, self.height), Image.NEAREST)
-        parse_array = np.array(im_parse)
-        # Load pose points
-        pose_name = im_name.replace('_0.jpg', '_2.json')
-        with open(os.path.join(self.dataroot, 'keypoints', pose_name), 'r') as f:
-            pose_label = json.load(f)
-            pose_data = pose_label['keypoints']
-            pose_data = np.array(pose_data)
-            pose_data = pose_data.reshape((-1, 4))
-        point_num = pose_data.shape[0]
-        pose_map = torch.zeros(point_num, self.height, self.width)
-        r = self.radius * (self.height / 512.0)
-        for i in range(point_num):
-            one_map = Image.new('L', (self.width, self.height))
-            draw = ImageDraw.Draw(one_map)
-            point_x = np.multiply(pose_data[i, 0], self.width / 384.0)
-            point_y = np.multiply(pose_data[i, 1], self.height / 512.0)
-            if point_x > 1 and point_y > 1:
-                draw.rectangle((point_x - r, point_y - r, point_x + r, point_y + r), 'white', 'white')
-            one_map = self.toTensor(one_map)
-            pose_map[i] = one_map[0]
-        agnostic_mask = self.get_agnostic(parse_array, pose_data, self.category, (self.width,self.height))
-        # agnostic_mask = transforms.functional.resize(agnostic_mask, (self.height, self.width),
-        #                                              interpolation=transforms.InterpolationMode.NEAREST)
-        mask = 1 - agnostic_mask
-        im_mask = image * agnostic_mask
-        pose_img = Image.open(
-            os.path.join(self.dataroot, "image-densepose", im_name)
-        )
-        pose_img = self.transform(pose_img)  # [-1,1]
-        result = {}
-        result["c_name"] = c_name
-        result["im_name"] = im_name
-        result["image"] = image
-        result["cloth_pure"] = self.transform(cloth)
-        result["cloth"] = self.clip_processor(images=cloth, return_tensors="pt").pixel_values
-        result["inpaint_mask"] =1-mask
-        result["im_mask"] = im_mask
-        result["caption_cloth"] = "a photo of " + cloth_annotation
-        result["caption"] = "model is wearing a " + cloth_annotation
-        result["pose_img"] = pose_img
-        return result
-    def __len__(self):
-        # model images + cloth image
-        return len(self.im_names)
-    def get_agnostic(self,parse_array, pose_data, category, size):
-        parse_shape = (parse_array > 0).astype(np.float32)
-        parse_head = (parse_array == 1).astype(np.float32) + \
-                    (parse_array == 2).astype(np.float32) + \
-                    (parse_array == 3).astype(np.float32) + \
-                    (parse_array == 11).astype(np.float32)
-        parser_mask_fixed = (parse_array == label_map["hair"]).astype(np.float32) + \
-                            (parse_array == label_map["left_shoe"]).astype(np.float32) + \
-                            (parse_array == label_map["right_shoe"]).astype(np.float32) + \
-                            (parse_array == label_map["hat"]).astype(np.float32) + \
-                            (parse_array == label_map["sunglasses"]).astype(np.float32) + \
-                            (parse_array == label_map["scarf"]).astype(np.float32) + \
-                            (parse_array == label_map["bag"]).astype(np.float32)
-        parser_mask_changeable = (parse_array == label_map["background"]).astype(np.float32)
-        arms = (parse_array == 14).astype(np.float32) + (parse_array == 15).astype(np.float32)
-        if category == 'dresses':
-            label_cat = 7
-            parse_mask = (parse_array == 7).astype(np.float32) + \
-                        (parse_array == 12).astype(np.float32) + \
-                        (parse_array == 13).astype(np.float32)
-            parser_mask_changeable += np.logical_and(parse_array, np.logical_not(parser_mask_fixed))
-        elif category == 'upper_body':
-            label_cat = 4
-            parse_mask = (parse_array == 4).astype(np.float32)
-            parser_mask_fixed += (parse_array == label_map["skirt"]).astype(np.float32) + \
-                                (parse_array == label_map["pants"]).astype(np.float32)
-            parser_mask_changeable += np.logical_and(parse_array, np.logical_not(parser_mask_fixed))
-        elif category == 'lower_body':
-            label_cat = 6
-            parse_mask = (parse_array == 6).astype(np.float32) + \
-                        (parse_array == 12).astype(np.float32) + \
-                        (parse_array == 13).astype(np.float32)
-            parser_mask_fixed += (parse_array == label_map["upper_clothes"]).astype(np.float32) + \
-                                (parse_array == 14).astype(np.float32) + \
-                                (parse_array == 15).astype(np.float32)
-            parser_mask_changeable += np.logical_and(parse_array, np.logical_not(parser_mask_fixed))
-        parse_head = torch.from_numpy(parse_head)  # [0,1]
-        parse_mask = torch.from_numpy(parse_mask)  # [0,1]
-        parser_mask_fixed = torch.from_numpy(parser_mask_fixed)
-        parser_mask_changeable = torch.from_numpy(parser_mask_changeable)
-        # dilation
-        parse_without_cloth = np.logical_and(parse_shape, np.logical_not(parse_mask))
-        parse_mask = parse_mask.cpu().numpy()
-        width = size[0]
-        height = size[1]
-        im_arms = Image.new('L', (width, height))
-        arms_draw = ImageDraw.Draw(im_arms)
-        if category == 'dresses' or category == 'upper_body':
-            shoulder_right = tuple(np.multiply(pose_data[2, :2], height / 512.0))
-            shoulder_left = tuple(np.multiply(pose_data[5, :2], height / 512.0))
-            elbow_right = tuple(np.multiply(pose_data[3, :2], height / 512.0))
-            elbow_left = tuple(np.multiply(pose_data[6, :2], height / 512.0))
-            wrist_right = tuple(np.multiply(pose_data[4, :2], height / 512.0))
-            wrist_left = tuple(np.multiply(pose_data[7, :2], height / 512.0))
-            if wrist_right[0] <= 1. and wrist_right[1] <= 1.:
-                if elbow_right[0] <= 1. and elbow_right[1] <= 1.:
-                    arms_draw.line([wrist_left, elbow_left, shoulder_left, shoulder_right], 'white', 30, 'curve')
-                else:
-                    arms_draw.line([wrist_left, elbow_left, shoulder_left, shoulder_right, elbow_right], 'white', 30,
-                                'curve')
-            elif wrist_left[0] <= 1. and wrist_left[1] <= 1.:
-                if elbow_left[0] <= 1. and elbow_left[1] <= 1.:
-                    arms_draw.line([shoulder_left, shoulder_right, elbow_right, wrist_right], 'white', 30, 'curve')
-                else:
-                    arms_draw.line([elbow_left, shoulder_left, shoulder_right, elbow_right, wrist_right], 'white', 30,
-                                'curve')
-            else:
-                arms_draw.line([wrist_left, elbow_left, shoulder_left, shoulder_right, elbow_right, wrist_right], 'white',
-                            30, 'curve')
-            if height > 512:
-                im_arms = cv2.dilate(np.float32(im_arms), np.ones((10, 10), np.uint16), iterations=5)
-            elif height > 256:
-                im_arms = cv2.dilate(np.float32(im_arms), np.ones((5, 5), np.uint16), iterations=5)
-            hands = np.logical_and(np.logical_not(im_arms), arms)
-            parse_mask += im_arms
-            parser_mask_fixed += hands
-        # delete neck
-        parse_head_2 = torch.clone(parse_head)
-        if category == 'dresses' or category == 'upper_body':
-            points = []
-            points.append(np.multiply(pose_data[2, :2], height / 512.0))
-            points.append(np.multiply(pose_data[5, :2], height / 512.0))
-            x_coords, y_coords = zip(*points)
-            A = np.vstack([x_coords, np.ones(len(x_coords))]).T
-            m, c = lstsq(A, y_coords, rcond=None)[0]
-            for i in range(parse_array.shape[1]):
-                y = i * m + c
-                parse_head_2[int(y - 20 * (height / 512.0)):, i] = 0
-        parser_mask_fixed = np.logical_or(parser_mask_fixed, np.array(parse_head_2, dtype=np.uint16))
-        parse_mask += np.logical_or(parse_mask, np.logical_and(np.array(parse_head, dtype=np.uint16),
-                                                            np.logical_not(np.array(parse_head_2, dtype=np.uint16))))
-        if height > 512:
-            parse_mask = cv2.dilate(parse_mask, np.ones((20, 20), np.uint16), iterations=5)
-        elif height > 256:
-            parse_mask = cv2.dilate(parse_mask, np.ones((10, 10), np.uint16), iterations=5)
-        else:
-            parse_mask = cv2.dilate(parse_mask, np.ones((5, 5), np.uint16), iterations=5)
-        parse_mask = np.logical_and(parser_mask_changeable, np.logical_not(parse_mask))
-        parse_mask_total = np.logical_or(parse_mask, parser_mask_fixed)
-        agnostic_mask = parse_mask_total.unsqueeze(0)
-        return agnostic_mask
-def main():
-    args = parse_args()
-    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir)
-    accelerator = Accelerator(
-        mixed_precision=args.mixed_precision,
-        project_config=accelerator_project_config,
-    )
-    if accelerator.is_local_main_process:
-        transformers.utils.logging.set_verbosity_warning()
-        diffusers.utils.logging.set_verbosity_info()
-    else:
-        transformers.utils.logging.set_verbosity_error()
-        diffusers.utils.logging.set_verbosity_error()
-    # If passed along, set the training seed now.
-    if args.seed is not None:
-        set_seed(args.seed)
-    # Handle the repository creation
-    if accelerator.is_main_process:
-        if args.output_dir is not None:
-            os.makedirs(args.output_dir, exist_ok=True)
-    weight_dtype = torch.float16
-    # if accelerator.mixed_precision == "fp16":
-    #     weight_dtype = torch.float16
-    #     args.mixed_precision = accelerator.mixed_precision
-    # elif accelerator.mixed_precision == "bf16":
-    #     weight_dtype = torch.bfloat16
-    #     args.mixed_precision = accelerator.mixed_precision
-    # Load scheduler, tokenizer and models.
-    noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
-    vae = AutoencoderKL.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="vae",
-        torch_dtype=torch.float16,
-    )
-    unet = UNet2DConditionModel.from_pretrained(
-        "yisol/IDM-VTON-DC",
-        subfolder="unet",
-        torch_dtype=torch.float16,
-    )
-    image_encoder = CLIPVisionModelWithProjection.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="image_encoder",
-        torch_dtype=torch.float16,
-    )
-    UNet_Encoder = UNet2DConditionModel_ref.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="unet_encoder",
-        torch_dtype=torch.float16,
-    )
-    text_encoder_one = CLIPTextModel.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="text_encoder",
-        torch_dtype=torch.float16,
-    )
-    text_encoder_two = CLIPTextModelWithProjection.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="text_encoder_2",
-        torch_dtype=torch.float16,
-    )
-    tokenizer_one = AutoTokenizer.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="tokenizer",
-        revision=None,
-        use_fast=False,
-    )
-    tokenizer_two = AutoTokenizer.from_pretrained(
-        args.pretrained_model_name_or_path,
-        subfolder="tokenizer_2",
-        revision=None,
-        use_fast=False,
-    )
-    # Freeze vae and text_encoder and set unet to trainable
-    unet.requires_grad_(False)
-    vae.requires_grad_(False)
-    image_encoder.requires_grad_(False)
-    UNet_Encoder.requires_grad_(False)
-    text_encoder_one.requires_grad_(False)
-    text_encoder_two.requires_grad_(False)
-    UNet_Encoder.to(accelerator.device, weight_dtype)
-    unet.eval()
-    UNet_Encoder.eval()
-    if args.enable_xformers_memory_efficient_attention:
-        if is_xformers_available():
-            import xformers
-            xformers_version = version.parse(xformers.__version__)
-            if xformers_version == version.parse("0.0.16"):
-                logger.warn(
-                    "xFormers 0.0.16 cannot be used for training in some GPUs. If you observe problems during training, please update xFormers to at least 0.0.17. See https://huggingface.co/docs/diffusers/main/en/optimization/xformers for more details."
-                )
-            unet.enable_xformers_memory_efficient_attention()
-        else:
-            raise ValueError("xformers is not available. Make sure it is installed correctly")
-    test_dataset = DresscodeTestDataset(
-        dataroot_path=args.data_dir,
-        phase="test",
-        order="unpaired" if args.unpaired else "paired",
-        category = args.category,
-        size=(args.height, args.width),
-    )
-    test_dataloader = torch.utils.data.DataLoader(
-        test_dataset,
-        shuffle=False,
-        batch_size=args.test_batch_size,
-        num_workers=4,
-    )
-    pipe = TryonPipeline.from_pretrained(
-            args.pretrained_model_name_or_path,
-            unet=unet,
-            vae=vae,
-            feature_extractor= CLIPImageProcessor(),
-            text_encoder = text_encoder_one,
-            text_encoder_2 = text_encoder_two,
-            tokenizer = tokenizer_one,
-            tokenizer_2 = tokenizer_two,
-            scheduler = noise_scheduler,
-            image_encoder=image_encoder,
-            torch_dtype=torch.float16,
-    ).to(accelerator.device)
-    pipe.unet_encoder = UNet_Encoder
-    # pipe.enable_sequential_cpu_offload()
-    # pipe.enable_model_cpu_offload()
-    # pipe.enable_vae_slicing()
-    with torch.no_grad():
-        # Extract the images
-        with torch.cuda.amp.autocast():
-            with torch.no_grad():
-                for sample in test_dataloader:
-                    img_emb_list = []
-                    for i in range(sample['cloth'].shape[0]):
-                        img_emb_list.append(sample['cloth'][i])
-                    prompt = sample["caption"]
-                    num_prompts = sample['cloth'].shape[0]
-                    negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
-                    if not isinstance(prompt, List):
-                        prompt = [prompt] * num_prompts
-                    if not isinstance(negative_prompt, List):
-                        negative_prompt = [negative_prompt] * num_prompts
-                    image_embeds = torch.cat(img_emb_list,dim=0)
-                    with torch.inference_mode():
-                        (
-                            prompt_embeds,
-                            negative_prompt_embeds,
-                            pooled_prompt_embeds,
-                            negative_pooled_prompt_embeds,
-                        ) = pipe.encode_prompt(
-                            prompt,
-                            num_images_per_prompt=1,
-                            do_classifier_free_guidance=True,
-                            negative_prompt=negative_prompt,
-                        )
-                        prompt = sample["caption_cloth"]
-                        negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
-                        if not isinstance(prompt, List):
-                            prompt = [prompt] * num_prompts
-                        if not isinstance(negative_prompt, List):
-                            negative_prompt = [negative_prompt] * num_prompts
-                        with torch.inference_mode():
-                            (
-                                prompt_embeds_c,
-                                _,
-                                _,
-                                _,
-                            ) = pipe.encode_prompt(
-                                prompt,
-                                num_images_per_prompt=1,
-                                do_classifier_free_guidance=False,
-                                negative_prompt=negative_prompt,
-                            )
-                        generator = torch.Generator(pipe.device).manual_seed(args.seed) if args.seed is not None else None
-                        images = pipe(
-                            prompt_embeds=prompt_embeds,
-                            negative_prompt_embeds=negative_prompt_embeds,
-                            pooled_prompt_embeds=pooled_prompt_embeds,
-                            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
-                            num_inference_steps=args.num_inference_steps,
-                            generator=generator,
-                            strength = 1.0,
-                            pose_img = sample['pose_img'],
-                            text_embeds_cloth=prompt_embeds_c,
-                            cloth = sample["cloth_pure"].to(accelerator.device),
-                            mask_image=sample['inpaint_mask'],
-                            image=(sample['image']+1.0)/2.0,
-                            height=args.height,
-                            width=args.width,
-                            guidance_scale=args.guidance_scale,
-                            ip_adapter_image = image_embeds,
-                        )[0]
-                    for i in range(len(images)):
-                        x_sample = pil_to_tensor(images[i])
-                        torchvision.utils.save_image(x_sample,os.path.join(args.output_dir,sample['im_name'][i]))
-if __name__ == "__main__":
-    main()

openpose/ckpts/body_pose_model.pth DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:25a948c16078b0f08e236bda51a385d855ef4c153598947c28c0d47ed94bb746
-size 209267595

requirements.txt CHANGED Viewed

@@ -2,9 +2,9 @@ transformers==4.36.2
 torch==2.0.1
 torchvision==0.15.2
 torchaudio==2.0.2
-numpy
-scipy
-scikit-image
 opencv-python==4.7.0.72
 pillow==9.4.0
 diffusers==0.25.0
@@ -20,4 +20,4 @@ av
 fvcore
 cloudpickle
 omegaconf
-pycocotools

 torch==2.0.1
 torchvision==0.15.2
 torchaudio==2.0.2
+numpy==1.24.4
+scipy==1.10.1
+scikit-image==0.21.0
 opencv-python==4.7.0.72
 pillow==9.4.0
 diffusers==0.25.0
 fvcore
 cloudpickle
 omegaconf
+pycocotools

scheduler/scheduler_config.json DELETED Viewed

@@ -1,19 +0,0 @@
-{
-  "_class_name": "DDPMScheduler",
-  "_diffusers_version": "0.21.0.dev0",
-  "beta_end": 0.012,
-  "beta_schedule": "scaled_linear",
-  "beta_start": 0.00085,
-  "clip_sample": false,
-  "interpolation_type": "linear",
-  "num_train_timesteps": 1000,
-  "prediction_type": "epsilon",
-  "sample_max_value": 1.0,
-  "set_alpha_to_one": false,
-  "skip_prk_steps": true,
-  "steps_offset": 1,
-  "timestep_spacing": "leading",
-  "trained_betas": null,
-  "use_karras_sigmas": false,
-  "rescale_betas_zero_snr": true
-}

text_encoder/config.json DELETED Viewed

@@ -1,25 +0,0 @@
-{
-  "_name_or_path": "/home/suraj_huggingface_co/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-base-1.0/snapshots/bf714989e22c57ddc1c453bf74dab4521acb81d8/text_encoder",
-  "architectures": [
-    "CLIPTextModel"
-  ],
-  "attention_dropout": 0.0,
-  "bos_token_id": 0,
-  "dropout": 0.0,
-  "eos_token_id": 2,
-  "hidden_act": "quick_gelu",
-  "hidden_size": 768,
-  "initializer_factor": 1.0,
-  "initializer_range": 0.02,
-  "intermediate_size": 3072,
-  "layer_norm_eps": 1e-05,
-  "max_position_embeddings": 77,
-  "model_type": "clip_text_model",
-  "num_attention_heads": 12,
-  "num_hidden_layers": 12,
-  "pad_token_id": 1,
-  "projection_dim": 768,
-  "torch_dtype": "float16",
-  "transformers_version": "4.29.2",
-  "vocab_size": 49408
-}

text_encoder/model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:79f531155c765c22c89e23328793a2e91a1178070af961c57e2eae5f0509b65b
-size 492265879

text_encoder_2/config.json DELETED Viewed

@@ -1,25 +0,0 @@
-{
-  "_name_or_path": "/home/suraj_huggingface_co/.cache/huggingface/hub/models--stabilityai--stable-diffusion-xl-base-1.0/snapshots/bf714989e22c57ddc1c453bf74dab4521acb81d8/text_encoder_2",
-  "architectures": [
-    "CLIPTextModelWithProjection"
-  ],
-  "attention_dropout": 0.0,
-  "bos_token_id": 0,
-  "dropout": 0.0,
-  "eos_token_id": 2,
-  "hidden_act": "gelu",
-  "hidden_size": 1280,
-  "initializer_factor": 1.0,
-  "initializer_range": 0.02,
-  "intermediate_size": 5120,
-  "layer_norm_eps": 1e-05,
-  "max_position_embeddings": 77,
-  "model_type": "clip_text_model",
-  "num_attention_heads": 20,
-  "num_hidden_layers": 32,
-  "pad_token_id": 1,
-  "projection_dim": 1280,
-  "torch_dtype": "float16",
-  "transformers_version": "4.29.2",
-  "vocab_size": 49408
-}

text_encoder_2/model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:283bb90f987a133dec11947571aca17692ed32f3fff708441ac8eedcfa4a040e
-size 2778702976

tokenizer/merges.txt DELETED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/special_tokens_map.json DELETED Viewed

@@ -1,24 +0,0 @@
-{
-  "bos_token": {
-    "content": "<|startoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "eos_token": {
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": "<|endoftext|>",
-  "unk_token": {
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  }
-}

tokenizer/tokenizer_config.json DELETED Viewed

@@ -1,33 +0,0 @@
-{
-  "add_prefix_space": false,
-  "bos_token": {
-    "__type": "AddedToken",
-    "content": "<|startoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "clean_up_tokenization_spaces": true,
-  "do_lower_case": true,
-  "eos_token": {
-    "__type": "AddedToken",
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "errors": "replace",
-  "model_max_length": 77,
-  "pad_token": "<|endoftext|>",
-  "tokenizer_class": "CLIPTokenizer",
-  "unk_token": {
-    "__type": "AddedToken",
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  }
-}

tokenizer/vocab.json DELETED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_2/merges.txt DELETED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_2/special_tokens_map.json DELETED Viewed

@@ -1,24 +0,0 @@
-{
-  "bos_token": {
-    "content": "<|startoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "eos_token": {
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": "!",
-  "unk_token": {
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  }
-}

tokenizer_2/tokenizer_config.json DELETED Viewed

@@ -1,33 +0,0 @@
-{
-  "add_prefix_space": false,
-  "bos_token": {
-    "__type": "AddedToken",
-    "content": "<|startoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "clean_up_tokenization_spaces": true,
-  "do_lower_case": true,
-  "eos_token": {
-    "__type": "AddedToken",
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "errors": "replace",
-  "model_max_length": 77,
-  "pad_token": "!",
-  "tokenizer_class": "CLIPTokenizer",
-  "unk_token": {
-    "__type": "AddedToken",
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  }
-}

tokenizer_2/vocab.json DELETED Viewed

The diff for this file is too large to render. See raw diff

unet/config.json DELETED Viewed

@@ -1,78 +0,0 @@
-{
-  "_class_name": "UNet2DConditionModel",
-  "_diffusers_version": "0.25.0",
-  "_name_or_path": "valhalla/sdxl-inpaint-ema",
-  "act_fn": "silu",
-  "addition_embed_type": "text_time",
-  "addition_embed_type_num_heads": 64,
-  "addition_time_embed_dim": 256,
-  "attention_head_dim": [
-    5,
-    10,
-    20
-  ],
-  "attention_type": "default",
-  "block_out_channels": [
-    320,
-    640,
-    1280
-  ],
-  "center_input_sample": false,
-  "class_embed_type": null,
-  "class_embeddings_concat": false,
-  "conv_in_kernel": 3,
-  "conv_out_kernel": 3,
-  "cross_attention_dim": 2048,
-  "cross_attention_norm": null,
-  "decay": 0.9999,
-  "down_block_types": [
-    "DownBlock2D",
-    "CrossAttnDownBlock2D",
-    "CrossAttnDownBlock2D"
-  ],
-  "downsample_padding": 1,
-  "dual_cross_attention": false,
-  "encoder_hid_dim": 1280,
-  "encoder_hid_dim_type": "ip_image_proj",
-  "flip_sin_to_cos": true,
-  "freq_shift": 0,
-  "in_channels": 13,
-  "inv_gamma": 1.0,
-  "layers_per_block": 2,
-  "mid_block_only_cross_attention": null,
-  "mid_block_scale_factor": 1,
-  "mid_block_type": "UNetMidBlock2DCrossAttn",
-  "min_decay": 0.0,
-  "norm_eps": 1e-05,
-  "norm_num_groups": 32,
-  "num_attention_heads": null,
-  "num_class_embeds": null,
-  "only_cross_attention": false,
-  "optimization_step": 37000,
-  "out_channels": 4,
-  "power": 0.6666666666666666,
-  "projection_class_embeddings_input_dim": 2816,
-  "resnet_out_scale_factor": 1.0,
-  "resnet_skip_time_act": false,
-  "resnet_time_scale_shift": "default",
-  "sample_size": 128,
-  "time_cond_proj_dim": null,
-  "time_embedding_act_fn": null,
-  "time_embedding_dim": null,
-  "time_embedding_type": "positional",
-  "timestep_post_act": null,
-  "transformer_layers_per_block": [
-    1,
-    2,
-    10
-  ],
-  "up_block_types": [
-    "CrossAttnUpBlock2D",
-    "CrossAttnUpBlock2D",
-    "UpBlock2D"
-  ],
-  "upcast_attention": null,
-  "update_after_step": 0,
-  "use_ema_warmup": false,
-  "use_linear_projection": true
-}

unet/diffusion_pytorch_model.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:046b775cb9bbc67635fc3b148bb03bfe00496ce2f9ce8488a82fdb388669a521
-size 11965769774

unet_encoder/config.json DELETED Viewed

@@ -1,68 +0,0 @@
-{
-  "_class_name": "UNet2DConditionModel",
-  "_diffusers_version": "0.19.0.dev0",
-  "act_fn": "silu",
-  "addition_embed_type_num_heads": 64,
-  "addition_time_embed_dim": 256,
-  "attention_head_dim": [
-    5,
-    10,
-    20
-  ],
-  "block_out_channels": [
-    320,
-    640,
-    1280
-  ],
-  "center_input_sample": false,
-  "class_embed_type": null,
-  "class_embeddings_concat": false,
-  "conv_in_kernel": 3,
-  "conv_out_kernel": 3,
-  "cross_attention_dim": 2048,
-  "cross_attention_norm": null,
-  "down_block_types": [
-    "DownBlock2D",
-    "CrossAttnDownBlock2D",
-    "CrossAttnDownBlock2D"
-  ],
-  "downsample_padding": 1,
-  "dual_cross_attention": false,
-  "encoder_hid_dim": null,
-  "encoder_hid_dim_type": null,
-  "flip_sin_to_cos": true,
-  "freq_shift": 0,
-  "in_channels": 4,
-  "layers_per_block": 2,
-  "mid_block_only_cross_attention": null,
-  "mid_block_scale_factor": 1,
-  "mid_block_type": "UNetMidBlock2DCrossAttn",
-  "norm_eps": 1e-05,
-  "norm_num_groups": 32,
-  "num_attention_heads": null,
-  "num_class_embeds": null,
-  "only_cross_attention": false,
-  "out_channels": 4,
-  "projection_class_embeddings_input_dim": 2816,
-  "resnet_out_scale_factor": 1.0,
-  "resnet_skip_time_act": false,
-  "resnet_time_scale_shift": "default",
-  "sample_size": 128,
-  "time_cond_proj_dim": null,
-  "time_embedding_act_fn": null,
-  "time_embedding_dim": null,
-  "time_embedding_type": "positional",
-  "timestep_post_act": null,
-  "transformer_layers_per_block": [
-    1,
-    2,
-    10
-  ],
-  "up_block_types": [
-    "CrossAttnUpBlock2D",
-    "CrossAttnUpBlock2D",
-    "UpBlock2D"
-  ],
-  "upcast_attention": null,
-  "use_linear_projection": true
-}

unet_encoder/diffusion_pytorch_model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:357650fbfb3c7b4d94c1f5fd7664da819ad1ff5a839430484b4ec422d03f710a
-size 10270077736

util/common.py DELETED Viewed

@@ -1,8 +0,0 @@
-import platform, os
-def open_folder():
-    open_folder_path = os.path.abspath("outputs")
-    if platform.system() == "Windows":
-        os.startfile(open_folder_path)
-    elif platform.system() == "Linux":
-        os.system(f'xdg-open "{open_folder_path}"')

util/image.py DELETED Viewed

@@ -1,37 +0,0 @@
-import os
-import numpy as np
-from PIL import Image
-def save_output_image(image, base_path="outputs", base_filename="inputimage", seed=0):
-    """Save an image with a unique filename in the specified directory."""
-    if not os.path.exists(base_path):
-        os.makedirs(base_path)
-    # Check for existing files and create a new filename
-    index = 0
-    while True:
-        if index == 0:
-            filename = f"{base_filename}_seed_{seed}.png"
-        else:
-            filename = f"{base_filename}_{str(index).zfill(4)}_seed_{seed}.png"
-        file_path = os.path.join(base_path, filename)
-        if not os.path.exists(file_path):
-            image.save(file_path)
-            break
-        index += 1
-    return file_path
-def pil_to_binary_mask(pil_image, threshold=0):
-    np_image = np.array(pil_image)
-    grayscale_image = Image.fromarray(np_image).convert("L")
-    binary_mask = np.array(grayscale_image) > threshold
-    mask = np.zeros(binary_mask.shape, dtype=np.uint8)
-    for i in range(binary_mask.shape[0]):
-        for j in range(binary_mask.shape[1]):
-            if binary_mask[i,j] == True :
-                mask[i,j] = 1
-    mask = (mask*255).astype(np.uint8)
-    output_mask = Image.fromarray(mask)
-    return output_mask

util/pipeline.py DELETED Viewed

@@ -1,88 +0,0 @@
-import torch
-import gc
-from torch import nn
-from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module
-import bitsandbytes as bnb
-def torch_gc():
-    if torch.cuda.is_available():
-        with torch.cuda.device('cuda'):
-            torch.cuda.empty_cache()
-            torch.cuda.ipc_collect()
-    gc.collect()
-def restart_cpu_offload(pipe, load_mode):
-    #if load_mode != '4bit' :
-    #    pipe.disable_xformers_memory_efficient_attention()
-    optionally_disable_offloading(pipe)
-    gc.collect()
-    torch.cuda.empty_cache()
-    pipe.enable_model_cpu_offload()
-    #if load_mode != '4bit' :
-    #    pipe.enable_xformers_memory_efficient_attention()
-def optionally_disable_offloading(_pipeline):
-    """
-    Optionally removes offloading in case the pipeline has been already sequentially offloaded to CPU.
-    Args:
-        _pipeline (`DiffusionPipeline`):
-            The pipeline to disable offloading for.
-    Returns:
-        tuple:
-            A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True.
-    """
-    is_model_cpu_offload = False
-    is_sequential_cpu_offload = False
-    print(
-            fr"Restarting CPU Offloading for {_pipeline.unet_name}..."
-          )
-    if _pipeline is not None:
-        for _, component in _pipeline.components.items():
-            if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
-                if not is_model_cpu_offload:
-                    is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload)
-                if not is_sequential_cpu_offload:
-                    is_sequential_cpu_offload = isinstance(component._hf_hook, AlignDevicesHook)
-                remove_hook_from_module(component, recurse=True)
-    return (is_model_cpu_offload, is_sequential_cpu_offload)
-def quantize_4bit(module):
-    for name, child in module.named_children():
-        if isinstance(child, torch.nn.Linear):
-            in_features = child.in_features
-            out_features = child.out_features
-            device = child.weight.data.device
-            # Create and configure the Linear layer
-            has_bias = True if child.bias is not None else False
-            # TODO: Make that configurable
-            # fp16 for compute dtype leads to faster inference
-            # and one should almost always use nf4 as a rule of thumb
-            bnb_4bit_compute_dtype = torch.float16
-            quant_type = "nf4"
-            new_layer = bnb.nn.Linear4bit(
-                in_features,
-                out_features,
-                bias=has_bias,
-                compute_dtype=bnb_4bit_compute_dtype,
-                quant_type=quant_type,
-            )
-            new_layer.load_state_dict(child.state_dict())
-            new_layer = new_layer.to(device)
-            # Set the attribute
-            setattr(module, name, new_layer)
-        else:
-            # Recursively apply to child modules
-            quantize_4bit(child)

utils_mask.py CHANGED Viewed

@@ -164,4 +164,4 @@ def get_mask_location(model_type, category, model_parse: Image.Image, keypoint:
     mask = Image.fromarray(inpaint_mask.astype(np.uint8) * 255)
     mask_gray = Image.fromarray(inpaint_mask.astype(np.uint8) * 127)
-    return mask, mask_gray

     mask = Image.fromarray(inpaint_mask.astype(np.uint8) * 255)
     mask_gray = Image.fromarray(inpaint_mask.astype(np.uint8) * 127)
+    return mask, mask_gray

vae/config.json DELETED Viewed

@@ -1,32 +0,0 @@
-{
-  "_class_name": "AutoencoderKL",
-  "_diffusers_version": "0.21.0.dev0",
-  "_name_or_path": "madebyollin/sdxl-vae-fp16-fix",
-  "act_fn": "silu",
-  "block_out_channels": [
-    128,
-    256,
-    512,
-    512
-  ],
-  "down_block_types": [
-    "DownEncoderBlock2D",
-    "DownEncoderBlock2D",
-    "DownEncoderBlock2D",
-    "DownEncoderBlock2D"
-  ],
-  "force_upcast": false,
-  "in_channels": 3,
-  "latent_channels": 4,
-  "layers_per_block": 2,
-  "norm_num_groups": 32,
-  "out_channels": 3,
-  "sample_size": 512,
-  "scaling_factor": 0.13025,
-  "up_block_types": [
-    "UpDecoderBlock2D",
-    "UpDecoderBlock2D",
-    "UpDecoderBlock2D",
-    "UpDecoderBlock2D"
-  ]
-}

vae/diffusion_pytorch_model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:98a14dc6fe8d71c83576f135a87c61a16561c9c080abba418d2cc976ee034f88
-size 334643268

vitonhd_test_tagged.json DELETED Viewed

The diff for this file is too large to render. See raw diff