TryOnDiffusion: The Most Powerful Model For Generating Fitting Images

Image Generation 30/09/2024

3 main points
✔️ Virtual try-on is the generation of an image of a person wearing a garment based on an image of the person and the garment
✔️ Virtual try-on can improve the online shopping experience, but traditional methods are effective only when body poses and shapes do not change much
✔️ TryOnDiffusion achieves state-of-the-art performance by utilizing parallel UNets to accommodate a wide variety of poses and body shapes without distorting the patterns and textures of the garment

TryOnDiffusion: A Tale of Two UNets
written by Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, Ira Kemelmacher-Shlizerman
(Submitted on 14 Jun 2023)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Virtual try-on is based on an image of a person and an image of a garment, with the goal of visualizing how the garment would look on the person. Virtual try-on can enhance the online shopping experience, but most traditional try-on methods work well only when changes in body pose and shape are small. The main challenge is to deform the garment non-rigidly to fit the target body shape without distorting the garment's patterns or textures.

This article introduces TryOnDiffusion, which handles large obstructions, pose changes, and body shape changes while preserving clothing details at a resolution of $1024×1024$. The method uses as input two images, one of the target person and one of a garment worn by another person, and produces as output an image of the target person wearing that garment.

To produce high-quality images at high resolutions of $1024×1024$, TryOnDiffusion employs a cascaded diffusion model. Specifically, it first uses Parallel-UNet (Parallel-UNet) based diffusion at resolutions of $128×128$ and $256×256$. The $256×256$ result is then input to the super-resolution diffusion model to produce the final $1024×1024$ image.

TryOnDiffusion was found to be significantly superior, both quantitatively and qualitatively, compared to other state-of-the-art methods. In particular, in human evaluation experiments, TryOnDiffusion was rated the best compared to the three most recent state-of-the-art techniques, 92.72% of the time.

Proposed Method

Input Preprocessing

Theinput for the entire modelconsists of four components,except for the person and clothing images,whichhave a total of six components,as shown in the upper left portion of Figure 1. First, the key points of the 2D pose are predicted by the learned model for both the person and garment images. Next, the garment images are segmented into garment-only images. For the person image, we generate a clothing-independent RGB image that preserves the person's features while removing the original clothing.

Trial Wear Cascade Diffusion Model

The cascade diffusion model of the proposed method consists of one basic diffusion model and two super-resolution (SR) diffusion models, as shown in the upper part of Figure 1. The basic diffusion model is parameterized as a $128×128$ Parallel-UNet (bottom part of Figure 1). This model generates the fitting image from the inputs described above.

The $128×128$ to $256×256$ super-resolution diffusion model is parameterized as a $256×256$ Parallel-UNet. In addition to the inputs described above, this model also utilizes the output of the $128×128$ model, the trial fitting image.

The super-resolution diffusion model from $256×256$ to $1024×1024$is parameterized as Efficient-UNet proposed bySaharia et.al.This is a pure super-resolution model and is not conditioned with the six component inputs described above.

Parallel-UNet

Parallel-UNet consists of person-UNet (person) and garment-UNet (garment), as shown in the lower part of Figure 1. person-UNet generates a try-on image from a person image with noise added and a garment removed as input. garment-UNet extracts appropriate features from the garment-only segmented image and inputs them to person-UNet so that the target garment is successfully generated in the fitting image.

Concatenation (concatenation of matrices) is conventionally used to input thefeatures extracted by garment-UNetto person-UNet ateach stage of UNet.However, since channel-by-channel concatenation may not be able to handle complex transformations such as garment warping, this paper utilized cross-attachment as shown in the following equation.

The calculation is similar to the usual Attention mechanism, but here $Q$ is the flattened feature of the person image and $K$ and $V$ are the flattened features of the garment. To ensure that the person pose and garment pattern are correctly reflected in the try-on image, the keypoints of the 2D poses of the person and garment images created in the preprocessing step are linearly embedded and input into person-UNet and garment-UNet using similar cross-attention.

Experiment

Data-Set

Four million samples were collected as a training paired data set. Each sample consisted of two images of the same person in different poses wearing the same clothing. For testing, we collected 6,000 samples that were never seen during training. Each test sample consisted of two different images of two different people in different poses and wearing different clothing. Both training and test images were cropped and resized to 1024 x 1024 based on the detected 2D human body poses. This dataset includes men and women in different poses, body shapes, skin tones, and wearing a variety of clothing with varying texture patterns. We also use the VITON-HD dataset

Comparison with Previous Studies

Figure 2: Qualitative comparison with previous studies

In this experiment, we compared the proposed method with TryOnGAN, SDAFN, and HR-VITON, which are representative models for virtual fitting image generation. Table 1 shows the comparison results in terms of FID and KID for image quality and naturalness. It can be seen that the proposed method outperforms previous studies on all evaluation metrics on both datasets.

Qualitative evaluation of Figure 2 shows similar results, indicating that the proposed method can produce images that are clearly more natural and clearer than previous studies. In particular, it was confirmed that the pose of the person and the pattern of his/her clothing were well reproduced.

In addition to qualitative and quantitative assessments, two user studies were conducted. The results are presented in Table 2. In the first study, "Random," 2804 input pairs were selected from 6,000 test sets and the evaluators chose the best results. In the second study, "Challenging," 2,000 input pairs of more difficult poses were selected and evaluated using the same procedure. As a result, the proposed method was selected as the best with 95.8%.

Comparison of Cross-Attention and Concatenation

As mentioned earlier, we utilized cross-attention instead of conventional Concatenation when inputting the features extracted by garment-UNet into person-UNet at each stage of UNet. In this section, the effectiveness of the cross-attention is evaluated. Figure 3 shows the comparison results, confirming that cross-attention is superior in retaining clothing details even under large body pose and shape changes.

Summary

In this article, we introduced TryOnDiffusion, which synthesizes a try-on image from an image of a person and an image of a garment. This method utilizes Parallel-UNet and achieves state-of-the-art performance by being able to handle a wide variety of poses and body shapes without distorting the patterns and textures of the garments. However, at this time, compatibility with images that have complex backgrounds and full-body images has not yet been studied. Future research is expected to overcome this issue and further development, especially in extending the system to video, is anticipated.