Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: MUSE (Mask Union for Stable Erasure), a windowed union strategy applied during temporal mask downsampling; DA-Seg (Denoising-Aware Segmentation), a lightweight decoupled side-branch segmentation head; and Curriculum Two-Stage Training, which first learns realistic background priors from unpaired data, then refines side-effect suppression with paired supervision. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks.
SVOR addresses three orthogonal failure modes of existing video object removal pipelines — imperfect mask guidance, temporal misalignment under abrupt motion, and residual side-effects — through three complementary components:
SVOR achieves state-of-the-art performance on DAVIS, ROSE Bench, and the newly introduced RORD-50 benchmark across all evaluation metrics including PSNR, SSIM, ReMOVE, and GPT-4o perceptual score.
MUSE in our method. Under abrupt motion, standard temporal downsampling can discard target locations entirely, causing missed removals and ghosting. MUSE replaces each compression window with the union of all mask frames within it, ensuring no object location is lost. Applying MUSE at inference already yields cleaner erasure; training with MUSE further eliminates residual flicker.
MUSE in existing methods. MUSE can be applied as a training-free preprocessing step to any pipeline using temporal mask compression. Simply replacing each mask group with its union consistently reduces artifacts under abrupt motion — across gen-omni, minimax, and ROSE — with no measurable degradation when motion is smooth.
@article{hu2026svor, title = {From Ideal to Real: Stable Video Object Removal under Imperfect Conditions}, author = {Hu, Jiagao and Chen, Yuxuan and Li, Fuhao and Wang, Zepeng and Wang, Fei and Zhou, Daiguo and Luan, Jian}, journal = {arXiv preprint arXiv:2603.09283}, year = {2026}, }