Abstract

Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: MUSE (Mask Union for Stable Erasure), a windowed union strategy applied during temporal mask downsampling; DA-Seg (Denoising-Aware Segmentation), a lightweight decoupled side-branch segmentation head; and Curriculum Two-Stage Training, which first learns realistic background priors from unpaired data, then refines side-effect suppression with paired supervision. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks.

SVOR key results

Method

SVOR addresses three orthogonal failure modes of existing video object removal pipelines — imperfect mask guidance, temporal misalignment under abrupt motion, and residual side-effects — through three complementary components:

MUSE
Mask Union for Stable Erasure. Applies element-wise temporal OR within each compression window to prevent location collapse under abrupt motion. Plug-and-play; no retraining required.
DA-Seg
Denoising-Aware Segmentation head on a decoupled side branch. Uses DA-AdaLN to condition localization on diffusion timestep, providing stable internal priors under defective masks without affecting backbone generation.
Two-Stage Training
Stage I pretrains on ~49K unpaired background videos to learn realistic completion priors. Stage II refines on synthetic pairs with mask degradation and side-effect-weighted losses.

SVOR Framework

Results

SVOR achieves state-of-the-art performance on DAVIS, ROSE Bench, and the newly introduced RORD-50 benchmark across all evaluation metrics including PSNR, SSIM, ReMOVE, and GPT-4o perceptual score.

Perfect masks.
Defective masks.

MUSE for Abrupt Motion Frames

1

MUSE in our method. Under abrupt motion, standard temporal downsampling can discard target locations entirely, causing missed removals and ghosting. MUSE replaces each compression window with the union of all mask frames within it, ensuring no object location is lost. Applying MUSE at inference already yields cleaner erasure; training with MUSE further eliminates residual flicker.

MUSE for our method
MUSE for our method.
2

MUSE in existing methods. MUSE can be applied as a training-free preprocessing step to any pipeline using temporal mask compression. Simply replacing each mask group with its union consistently reduces artifacts under abrupt motion — across gen-omni, minimax, and ROSE — with no measurable degradation when motion is smooth.

MUSE for existing methods
MUSE for existing methods.

BibTeX

@article{hu2026svor,
  title     = {From Ideal to Real: Stable Video Object Removal under Imperfect Conditions},
  author    = {Hu, Jiagao and Chen, Yuxuan and Li, Fuhao and Wang, Zepeng and Wang, Fei and Zhou, Daiguo and Luan, Jian},
  journal = {arXiv preprint arXiv:2603.09283},
  year      = {2026},
}