Virtual Try-on Is Almost A Reality! Generative Modeling Frontline!
3 main points
✔️ The University of Hong Kong and Tencent collaborate to develop a new virtual try-on technology
✔️ Propose a parse-free "teacher-tutor-student" model without segmentation information
✔️ Achieved SoTA on various data sets
Parser-Free Virtual Try-on via Distilling Appearance Flows
written by Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, Ping Luo
(Submitted on 8 Mar 2021 (v1), last revised 9 Mar 2021 (this version, v2))
Comments: Accepted by CVPR2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
first of all
Virtual try-on is the process of matching a garment image to an image of a human body. This task has been addressed by many researchers due to its potential applications in e-commerce and fashion image editing. most of the SoTA methods such as VTON, CP-VTON, VTNEP, ClothFlow, ACGPN, CP-VTON+, etc. Most of the SoTA methods, such as VTON, CP-VTON, VTNEP, ClothFlow, ACGPN, and CP-VTON+, required segmentation information of different parts of the body, such as the upper body. However, even a small segmentation error can result in a very unrealistic fitting image as shown in the image above, so a highly accurate parsing (segmentation) is required for the fitting model.
In order to reduce the dependence on segmentation information, a parsing-free network called WUTON was proposed. WUTON distills the parsing-based model as a " teacher " network, which is then used as a parsing-free "student" network. student " network to generate fitting images. However, while WUTON does not require segment information as input, it ultimately trains the " student " using the parsing-based model as the " teacher ", and its accuracy depends on the parsing-based model.
To address these challenges, a parse-free Parse Free Appearance Flow Network (PF-AFN ) is proposed in this paper.
Let's review a few related terms.
Virtual try-on: Existing deep learning-based methods for virtual try-on can be broadly classified into 3D model-based approaches and 2D image-based approaches. 2D image-based approaches are more widely used because 3D model-based approaches require additional 3D measurements and computing resources. Most of the existing 2D image-based researches have masked the clothing part of the human image and reconstructed the image according to the corresponding clothing image, which needs to be parsed with high accuracy. Recently, WUTON has proposed a parsing-free method, but it still depends on the performance of the parsing-based model.
Appearance Flow: Appearance Flow is a 2D coordinate vector that indicates which pixels in the source image can be used to composite the target. It is used for visual tracking, image restoration, and super-resolution of face images.
Knowledge Distillation: Knowledge distillation is an idea that was initially introduced for model compression, leveraging the information specific to the " teacher " network to train the " student " network. Recently, however, it has been shown that knowledge distillation can also transfer knowledge between different tasks, so that knowledge learned by multiple models can be transferred to a single model.
In this paper, a parser-free model, Parser Free Appearance Flow Network ( PF-AFN ), which does not require segmentation information, is proposed. Unlike conventional models such as WUTON, this is the first model to distill knowledge in a three-stage " teacher-tutor-student " structure. In the above figure, the difference between PF-AFN and WUTON is illustrated.
As can be seen from the above figure, our method includes two networks: PB-AFN, which is a parse-based network, and PF-AFN, which is a parse-free network. As a training procedure, we first train the PB-AFN on an image of a garment and a person wearing this garment, as in the existing method. We concatenate the paper, the face, the lower part of the clothes, the body segmentation results, and the pose estimation results. By concatenating this distorted clothing image with the stored parts of the human image and the pose estimation, we can train the generation module to synthesize a try-on image together with a ground-truth teacher image.
Next, after training this PB-AFN, we randomly select different clothing images and generate images of the same person trying on different clothes. This parsing-based model is treated as a " tutor " network, and the fake images generated here are treated as " tutor knowledge ".In PF-AFN, a warping module is used to predict the appearance flow of tutor and clothing images, and the In PF-AFN, a warping module is used to predict the appearance flow between the tutor and the garment image, and a generation module synthesizes the tutor and the garment image with distortions of the student. In this paper, the real image is treated as " teacher knowledge " to correct the mistakes of the student and to enable the student to imitate the original real image appropriately. In addition, the tutor network, PB-AFN, distills the knowledge from the appearance flow to the student network, PF-AFN.
Appearance Flow Warping Module ( AFWM )
Both PB-AFN and PF-AFN include a warping module, which predicts the association between the image of a garment and the image of a person to distort the garment. As shown in the previous figure, the output of the warping module is an appearance flow, which is 2D coordinate vectors. The warping module consists of two pyramid feature extraction networks ( PFEN ) and a progressive appearance flow estimation network (AFEN). The Warping module consists of two pyramid feature extraction networks (PFEN) and a progressive appearance flow estimation network (AFEN ). At each pyramid level, AFEN generates appearance flows, which are modified at the next level. The parse-based warping module ( PB-AFWM ) and the parse-free warping module ( PF-AFWM ) have exactly the same architecture except for the difference of input.
Pyramid Feature Extraction Network (PFEN)
As shown in (b ) in the previous figure above, PFEN consists of two feature pyramid networks ( FPNs ) and extracts two branch pyramid features from the N hierarchy. The input of the parser-based warping module is the clothing image and the human features, while the input of the parse-free warping module is the clothing image and the generated fake image.
Appearance Flow Estimation Network (AFEN)
AFEN consists of NFlow Networks ( FNs ) and estimates appearance flows from pyramid features at N levels. The pyramid features extracted at the highest N levels are first given to FN-1 to estimate the first appearance flow. Next, pyramid features at the N-1 level are given to FN-2. This process is repeated until the last layer, and finally, the target garment is distorted according to the last output.
Generative Module ( GM)
Both PB-AFN and PF-AFN include a generation module for combining try-on images. The parse-based generation module ( PB-GM ) combines warped clothing, human posture estimation, and preserved regions of the body, while the parse-free generation module ( PF-GM ) combines warped clothing and tutor images as input. Both modules employ Res-UNet, which is built upon the UNet architecture.
In the training phase, the parameters of both the generative model and the warping module, AFWM, are optimized by the following equation.
Ll: pixel-wise L1 loss
Lp: perceptual loss
Adjustable Knowledge Distillation
Besides supervised training of the parser-free student network, PF-AFN, on authentic images, this paper also distills the appearance flow between human and clothing images to help find their correlations. features extracted from PB-AFN generally capture rich semantic information, and the estimated appearance flows are likely to be more accurate and can therefore guide the PF-AFN. However, as pointed out earlier, if the parsing results are inaccurate, then the parsing-based PB-AFN will also guide the PF-AFN quite differently, making the semantic information and the estimated appearance flow quite strange. To avoid this, we introduce a completely new adjustable distillation loss in this paper. The definition is as follows.
The experiments were performed with VITON, VITON-HD, and MPV, respectively.
As can be seen from the above figure, PF-AFN achieves SoTA performance on both VITON and MPV datasets.
In this paper, we use a completely new knowledge distillation method called " teacher-tutor-student " to generate high-performance fitting images without parsing. The interesting feature of this approach is that the fake images generated by the parsing-based tutor network are treated as input to the parsing-free student network, which is then supervised by the original real images ( teacher knowledge ). In addition to using the real image as a teacher, we also distill the appearance flow between the human image and the clothing image to help find correlations. Experimental results show that AF-PFN achieves SoTA on various datasets.
Categories related to this article