
Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation
3 main points
✔️ Dress&Dance is a method for generating high-resolution try-on + dance videos from a single image and reference video
✔️ Integrates clothing, people, and motion with CondNet to achieve faithful clothing reproduction and natural motion generation
✔️ Experiments have shown that it outperforms open source and commercial methods to achieve high-quality virtual Trial Clothing Videos Achieved
Dress&Dance: Dress up and Dance as You Like It - Technical Preview
written by Jun-Kun Chen, Aayush Bansal, Minh Phuoc Vo, Yu-Xiong Wang
(Submitted on 28 Aug 2025)
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Overview
This research proposes a new framework, "Dress&Dance," that virtually tries on clothes selected by the user and generates a video with arbitrary dance movements.
Conventional virtual try-on systems are often limited to still images, making it impossible to confirm the feeling of wearing the garment and the natural swaying of the fabric as a result of movement.
In addition, if existing video-generated models are simply combined, there are noticeable problems such as the pattern of the garment breaking down and the inability to respond to pose changes.
Dress&Dance takes a single image of the user, an image of the garment to be tried on, and a reference video showing the action as input, and generates a 5-second, 24FPS, 1152 x 720 high-quality video.
Notably, users can try on both upper and lower garments at the same time, and the system is flexible enough to transfer garments worn by others.
In addition, accessories such as bags and shoes are retained, enabling a realistic and consistent try-on experience.
The system achieves a quality far superior to existing open source and commercial systems, and is expected to have innovative applications in online shopping and entertainment in the future.
Proposed Methodology
The core of the proposed method Dress&Dance is CondNet, a new conditioning network that utilizes attention mechanisms.
CondNet enables the unified processing of heterogeneous inputs such as text, images, and video, and improves garment registration and motion fidelity.
Specifically, user images, garment images, and motion-referenced videos are each tokenized and incorporated into the cross-attention of the diffusion model, which allows each pixel of the generated video to be linked to the entire input.
This design enables natural video generation that follows the body's movements while preserving the details and texture of the garment.
In addition, a two-stage strategy is employed to improve training efficiency.
First, a "warm-up phase" through curriculum learning allows the system to learn to estimate the position of clothing, followed by "incremental learning" that gradually raises the resolution to achieve stability and higher resolution.
Furthermore, video quality is further improved by introducing a dedicated refiner module that increases the initial 8FPS output to 24FPS resolution.
The strength of this method lies in its ability to generate high-resolution and realistic fitting videos by effectively utilizing a small amount of video data and a large amount of image data through these innovations.
Experiments
The authors evaluated the performance of Dress&Dance from multiple perspectives.
First, they used two types of data sets that they constructed independently.
The first was approximately 80,000 garment/video pairs collected from the Internet, and the second was try-on video data recorded by 183 models.
In addition, approximately 4 million pairs of garment images were used together to augment the learning.
Three modes were tested in the experiment.
The three modes were tested in the experiment: single garment try-on, multiple garment mode in which the upper and lower garments are tried on at the same time, and a mode in which garments worn by others are transferred.
Open source methods such as TPD and OOTDiffusion, and commercial models such as Kling and Ray2 were set up for comparison.
In addition to quantitative evaluation such as PSNR, SSIM, and LPIPS, subjective evaluation such as fitting fidelity and motion quality using GPT are incorporated as evaluation indices.
As a result, Dress&Dance significantly outperformed other methods in terms of fitting fidelity, and performed as well as or better than commercial systems in terms of visual quality and motion reproducibility.
Furthermore, ablation experiments confirmed that CondNet's design and stepwise learning strategy contributed significantly to the final quality improvement.
All in all, this method is an achievement that breaks through the current limitations and opens up the possibility of practical video fitting.
Categories related to this article