PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

Abstract

Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions.

To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols.

Motivation

Existing evaluation metrics for object removal exhibit systematic biases that conflict with human perception.

Figure 1. Illustrative examples of metric bias in object removal evaluation. (a) Full-reference metrics reward copy-paste behavior over genuine erasure. (b) No-reference metrics favor blurry outputs across diffusion steps. (c) Traditional vs. diffusion-based methods show inconsistencies between metric judgments and visual perception.

Figure 2. RC-S captures locally visible side effects and residual artifacts. Human-perceived ranking (1 = best): D > B > A > C. RC-S ranking: D(1) > B(2) > A(3) > C(4), consistent with human perception. In contrast, ReMOVE ranks A(1) > B(3) > C(4) > D(2), and CFD ranks A(1) > B(2) > C(4) > D(3) — both incorrectly favor the residual-containing result A over the cleanest removal D.

Full-Reference Bias

FR metrics (PSNR, SSIM, LPIPS) assume strict point-to-point correspondence to a single reference, rewarding conservative copy-paste outputs over perceptually realistic restorations.

No-Reference Blind Spots

NR metrics like ReMOVE and CFD frequently assign inflated scores to blurry outputs and incorrectly penalize structurally sound restorations in complex occlusion scenarios.

Temporal Insensitivity

Global temporal metrics (TC, TF) are dominated by unchanged background regions, failing to detect localized artifacts within the removed regions where object removal most commonly fails.

Figure 3. “Blur is Clean” bias analysis. As blur radius increases inside the masked region, ReMOVE scores incorrectly improve and CFD incorrectly decreases (appears better), while RC-S correctly and monotonically degrades — demonstrating its robustness against the blur-favoring bias.

Proposed RC Metrics

We introduce a unified local distribution matching framework in deep semantic feature space, named Removal Coherence (RC), instantiated as two complementary metrics.

Figure 4. Overview of the proposed RC metrics. (a) RC-S measures intra-frame spatial coherence by comparing masked and background feature distributions within sliding windows. (b) RC-T measures inter-frame temporal consistency by comparing restored-region feature distributions across adjacent frames under union-based cropping and intersection-based evaluation.

RC-S — Spatial Coherence

RC-S evaluates spatial coherence by cropping each target region, extracting DINOv2 features, and applying a sliding-window MMD to compare feature distributions inside and outside the removed region. This enables fine-grained detection of local spatial incoherence that global metrics miss.

RC-T — Temporal Consistency

RC-T extends the local distribution matching design to the temporal domain. It jointly crops adjacent frames under a shared union mask, then measures feature distribution drift exclusively within the intersected restored regions, yielding sensitive detection of local temporal instability.

Key Design Choices

DINOv2 Features

DINOv2 provides a perceptually sensitive feature space for assessing fine-grained local coherence, showing stronger alignment with low-level human visual characteristics.

Sliding Window

Window-based comparison exposes regional inconsistency more explicitly than global aggregation, better matching the way humans visually inspect removal results.

MMD Distance

Maximum Mean Discrepancy accurately measures local distribution shifts between restored regions and surrounding context, outperforming first-order cosine similarity.

Why Local Cropping Matters

Cropping strategies of ReMOVE vs. RC-S. Green boxes denote crop regions, red areas denote masks. ReMOVE uses a single enlarged crop even when targets are spatially far apart, introducing excessive irrelevant background that dilutes the feature difference. RC-S crops each target independently with local context, enabling fine-grained detection of incoherence.

Counter-intuitive rankings under local perturbations. We apply Gaussian blur or region swap to the masked areas. Human judgment clearly prefers: ori > blur > swap. However, ReMOVE and CFD produce the reverse ordering (red = best-ranked, green = second-best), while RC-S remains consistent with human perception.

PROVE-Bench

A two-tier real-world benchmark for evaluating object removal in video, combining paired evaluability with unconstrained stress testing.

Figure 5. Construction pipeline of PROVE-M: real-world paired capture with controlled conditions, three-stage pairwise quality control, and Ken Burns-style motion augmentation applied synchronously to input–mask–GT triplets.

PROVE-M — Motion-Augmented Paired Benchmark

80 videos with aligned input–mask–ground-truth triplets captured in real-world scenes.

Real-world paired capture with tripod-mounted camera
Motion augmentation via Ken Burns-style transformations
81 frames at 1080p resolution per video
Three-stage quality control filtering pipeline
Covers shadows, reflections, multiple effects, and fast motion

PROVE-H — Hard Real-World Benchmark

100 videos targeting challenging scenarios without ground truth.

Crowd scenes and dense occlusions
Dynamic backgrounds (water, flames, rain, snow)
Highly textured backgrounds (grasslands, deserts)
Complex reflections and intertwined side effects
Fast-motion scenes with SAM3-generated masks

(a) PROVE-M — Sample frames from the motion-augmented paired benchmark.

(b) PROVE-H — Sample frames from the hard real-world benchmark.

Comparison with Existing Benchmarks

Dataset	Real	GT	Shadows	Reflections	Multi-Effect	Disconnected	Crowds	Textured	Fast Motion	#Videos
DAVIS	✓	✗	✓	✓	✗	✗	✗	✓	✓	90
Movies	✗	✓	✓	✓	✗	✗	✗	✗	✓	5
Kubric	✗	✓	✓	✗	✗	✓	✗	✗	✗	5
GenProp	✓	✗	✓	✓	✗	✗	✗	✗	✗	15
ROSE-Bench	✗	✓	✓	✓	✗	✓	✗	✗	✗	60
PROVE-M (Ours)	✓	✓	✓	✓	✓	✓	✗	✗	✓	80
PROVE-H (Ours)	✓	✗	✓	✓	✓	✓	✓	✓	✓	100

Results

Benchmark Results on PROVE-M

Quantitative evaluation of mainstream video object removal methods on the PROVE-M benchmark. ↓ means lower is better, ↑ means higher is better.

Method	PSNR↑	SSIM↑	LPIPS↓	ReMOVE↑	CFD↓	RC-S↑	RC-T↓
FGT	21.6511	0.8619	0.2013	0.8622	0.3229	0.3797	0.8031
ProPainter	22.1846	0.8768	0.1559	0.8676	0.2774	0.4427	0.5951
DiffuEraser	22.0758	0.8706	0.1518	0.8681	0.3308	0.4787	0.4851
VACE (1.3B)	20.0826	0.8654	0.1545	0.8117	0.3283	0.4036	0.5217
Minimax-Remover (1.3B)	21.7476	0.8707	0.1542	0.8710	0.3202	0.4793	0.4485
GenOmni (CogV5B)	25.0165	0.9030	0.1223	0.8755	0.3842	0.5029	0.3145
GenOmni (Wan1.3B)	25.1480	0.9017	0.1109	0.8815	0.3457	0.5188	0.3238
ROSE (1.3B)	26.1333	0.9003	0.1212	0.8803	0.3364	0.4924	0.6538
EffectErase (1.3B)	27.0049	0.9098	0.1142	0.8841	0.3412	0.5270	0.2728
UnderEraser (14B)	28.3325	0.9156	0.0981	0.8824	0.2986	0.5188	0.3276
SVOR (1.3B)	27.4289	0.9239	0.0839	0.8836	0.2794	0.5236	0.2987

Benchmark Results on PROVE-H

Quantitative evaluation of mainstream video object removal methods on the PROVE-H benchmark. ↓ means lower is better, ↑ means higher is better.

Method	PSNR↑	SSIM↑	LPIPS↓	ReMOVE↑	CFD↓	RC-S↑	RC-T↓
FGT	29.4448	0.8615	0.1927	0.8474	0.3065	0.3716	0.5866
ProPainter	33.3531	0.9274	0.1063	0.8383	0.2830	0.3932	0.4453
DiffuEraser	31.4112	0.9178	0.1098	0.8440	0.3165	0.4387	0.3911
VACE (1.3B)	26.7266	0.8898	0.1071	0.8047	0.3288	0.4192	0.3438
Minimax-Remover (1.3B)	29.6021	0.8660	0.1315	0.8545	0.3320	0.4617	0.3277
GenOmni (CogV5B)	28.7643	0.8873	0.1183	0.8536	0.3516	0.5006	0.2141
GenOmni (Wan1.3B)	29.3140	0.8940	0.1027	0.8596	0.3422	0.5127	0.2368
ROSE (1.3B)	27.6261	0.8508	0.1402	0.8538	0.3361	0.4687	0.4373
EffectErase (1.3B)	24.3793	0.8156	0.1742	0.8532	0.3590	0.5081	0.2363
UnderEraser (14B)	27.4989	0.8485	0.1434	0.8560	0.3165	0.5075	0.2688
SVOR (1.3B)	27.5335	0.8907	0.1046	0.8574	0.3107	0.5166	0.2419

Note: Due to compliance requirements, the open-source data differs slightly from the data used in the paper. The results above are based on the open-source version and may exhibit minor numerical differences from the paper, but the overall trends remain consistent.

Human Correlation Analysis

RC-S achieves the best average correlation with human rankings across six benchmarks, ranking first on five of six benchmarks under both Kendall's τ and Spearman's ρ.

Metric	Kendall's τ							Avg. Spearman ρ
Metric	RORD	OBER	DAVIS	ROSE	PROVE-M	PROVE-H	AVG	RORD	OBER	DAVIS	ROSE	PROVE-M	PROVE-H	AVG
PSNR	0.01	—	—	0.36	0.38	—	0.25	0.02	—	—	0.44	0.45	—	0.30
SSIM	-0.22	—	—	0.11	0.43	—	0.11	-0.31	—	—	0.11	0.46	—	0.09
LPIPS	-0.23	—	—	0.24	0.33	—	0.12	-0.28	—	—	0.28	0.37	—	0.13
m-LPIPS	0.19	—	—	0.68	0.68	—	0.52	0.24	—	—	0.75	0.75	—	0.58
ReMOVE	0.06	0.54	0.15	0.21	0.33	0.23	0.26	0.08	0.61	0.16	0.24	0.36	0.27	0.29
CFD	-0.04	0.40	0.21	0.03	0.24	0.12	0.16	-0.05	0.47	0.25	0.04	0.26	0.14	0.18
RC-S (Ours)	0.31	0.57	0.60	0.61	0.70	0.76	0.59	0.39	0.66	0.68	0.69	0.75	0.82	0.66

Validation of RC-T

RC-T responds sensitively and monotonically to controlled temporal corruptions, whereas existing temporal metrics (TC, TF) remain largely insensitive.

Figure 6. Sensitivity-based validation of temporal metrics under increasing corruption severity. RC-T exhibits monotonically degrading scores, whereas TC and TF remain insensitive or even improve.