Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

26/09/2025

3 main points
✔️ Dress&Dance is a method for generating high-resolution try-on + dance videos from a single image and reference video
✔️ Integrates clothing, people, and motion with CondNet to achieve faithful clothing reproduction and natural motion generation
✔️ Experiments have shown that it outperforms open source and commercial methods to achieve high-quality virtual Trial Clothing Videos Achieved

Dress&Dance: Dress up and Dance as You Like It - Technical Preview
written by Jun-Kun Chen, Aayush Bansal, Minh Phuoc Vo, Yu-Xiong Wang
(Submitted on 28 Aug 2025)
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

This research proposes a new framework, "Dress&Dance," that virtually tries on clothes selected by the user and generates a video with arbitrary dance movements.

Conventional virtual try-on systems are often limited to still images, making it impossible to confirm the feeling of wearing the garment and the natural swaying of the fabric as a result of movement.
In addition, if existing video-generated models are simply combined, there are noticeable problems such as the pattern of the garment breaking down and the inability to respond to pose changes.

Dress&Dance takes a single image of the user, an image of the garment to be tried on, and a reference video showing the action as input, and generates a 5-second, 24FPS, 1152 x 720 high-quality video.
Notably, users can try on both upper and lower garments at the same time, and the system is flexible enough to transfer garments worn by others.

In addition, accessories such as bags and shoes are retained, enabling a realistic and consistent try-on experience.
The system achieves a quality far superior to existing open source and commercial systems, and is expected to have innovative applications in online shopping and entertainment in the future.

Proposed Methodology

The core of the proposed method Dress&Dance is CondNet, a new conditioning network that utilizes attention mechanisms.

CondNet enables the unified processing of heterogeneous inputs such as text, images, and video, and improves garment registration and motion fidelity.
Specifically, user images, garment images, and motion-referenced videos are each tokenized and incorporated into the cross-attention of the diffusion model, which allows each pixel of the generated video to be linked to the entire input.
This design enables natural video generation that follows the body's movements while preserving the details and texture of the garment.

In addition, a two-stage strategy is employed to improve training efficiency.
First, a "warm-up phase" through curriculum learning allows the system to learn to estimate the position of clothing, followed by "incremental learning" that gradually raises the resolution to achieve stability and higher resolution.

Furthermore, video quality is further improved by introducing a dedicated refiner module that increases the initial 8FPS output to 24FPS resolution.
The strength of this method lies in its ability to generate high-resolution and realistic fitting videos by effectively utilizing a small amount of video data and a large amount of image data through these innovations.

Experiments

The authors evaluated the performance of Dress&Dance from multiple perspectives.

First, they used two types of data sets that they constructed independently.
The first was approximately 80,000 garment/video pairs collected from the Internet, and the second was try-on video data recorded by 183 models.

In addition, approximately 4 million pairs of garment images were used together to augment the learning.
Three modes were tested in the experiment.
The three modes were tested in the experiment: single garment try-on, multiple garment mode in which the upper and lower garments are tried on at the same time, and a mode in which garments worn by others are transferred.

Open source methods such as TPD and OOTDiffusion, and commercial models such as Kling and Ray2 were set up for comparison.
In addition to quantitative evaluation such as PSNR, SSIM, and LPIPS, subjective evaluation such as fitting fidelity and motion quality using GPT are incorporated as evaluation indices.

As a result, Dress&Dance significantly outperformed other methods in terms of fitting fidelity, and performed as well as or better than commercial systems in terms of visual quality and motion reproducibility.
Furthermore, ablation experiments confirmed that CondNet's design and stepwise learning strategy contributed significantly to the final quality improvement.

All in all, this method is an achievement that breaks through the current limitations and opens up the possibility of practical video fitting.

Categories related to this article

nakata

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Overview

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects