ROSE: A New Method And Benchmark For Video Object Removal With Side Effects

25/09/2025

3 main points
✔️ In addition to object removal in videos, proposed method simultaneously eliminates side effects such as shadows, reflections, and light sources
✔️ Created synthetic data in Unreal Engine and trained by introducing difference mask prediction into diffusion model
✔️ Validated with new benchmark ROSE-Bench and showed significantly better performance and generalizability than conventional methods Demonstrated performance and generalizability far superior to conventional methods

ROSE: Remove Objects with Side Effects in Videos
written by Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, Hengshuang Zhao
(Submitted on 26 Aug 2025)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

The objective of this research is to simultaneously eliminate not only the object itself but also the side effects (shadow, reflection, light, transmission, mirror image, etc.) caused by its presence in video object removal.

Conventional video inpainting methods are effective in removing the object itself, but they have the problem of generating unnatural images because they cannot adequately handle the effects of shadows and reflections on the surrounding environment.
Behind this problem is the lack of paired video data (with/without objects) that includes these side effects.

Therefore, the authors built an automated rendering pipeline using the Unreal Engine to create a large composite dataset that faithfully reproduces the side effects of objects.
The proposed method, ROSE, is a diffusion transducer-based video inpainting model, which is unique in that it identifies side effects using the entire video as input.

Furthermore, it introduces an explicit supervisory signal based on difference mask prediction to capture side-effect regions with high accuracy.
In addition, a new benchmark called ROSE-Bench was constructed and comprehensively evaluated in scenarios involving a wide variety of side effects.

Experimental results showed that ROSE significantly outperforms existing methods and has high generalization ability to real-world videos.

Proposed Method

The proposed method ROSE is a video inpainting method based on a diffusion model and transformers.

Conventional methods employ a "mask-and-inpaint" method in which masked regions are replaced by zero values, but this method cannot accurately identify the side-effect regions of objects.

ROSE employs the "reference-based erasing" method, in which the entire video is used as input and the interaction between the object and its environment is learned by the model's internal attention mechanism.
This enables natural detection and removal of side effects such as shadows and reflections.

In addition, a "mask expansion" method is introduced to simulate various mask accuracies, such as coarse rectangles and point annotations, for real-world use.
Furthermore, by using "difference masks" obtained from the difference between the original video and the video after object removal for training, explicit localization of adverse effect regions is achieved.
This allows ROSE to accurately identify and repair not only the object itself, but also its impact on the environment.

The final loss function was designed as a combination of diffusion restoration loss and mask prediction loss, with the balance between the two tuned to ensure that the model learns stably.

Experiments

In our experiments, we first trained on 16,678 synthetic video pairs generated by the Unreal Engine.
These were created at 90 frames and 1080p resolution in diverse scenes, including urban and natural environments, covering side effects such as shadows, reflections, light sources, transmissions, and mirror images.

For evaluation, the newly constructed ROSE-Bench was used.
In addition to synthetic data, it utilizes the existing video segmentation dataset DAVIS to create realistic pairs for evaluation, and also includes non-pair evaluation using real videos.

Representative methods such as DiffuEraser and ProPainter were selected for comparison.
As a result, ROSE significantly outperformed existing methods in quantitative indices such as PSNR, SSIM, and LPIPS, and demonstrated superior performance, especially in challenging side effects such as light sources and mirror images.

It also scored highly in real-world video evaluation using the VBench index in terms of background consistency and smoothness of motion.
Furthermore, ablation studies have confirmed that reference-based erasing, mask expansion, and difference mask prediction are effective in improving performance.

Overall, ROSE is a state-of-the-art method for simultaneous object removal and side-effect removal, and has shown results that exceed conventional limits.

Categories related to this article

nakata

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects

Overview

Proposed Method

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation