Figure 1. This work focuses on an underexplored task of Multi-shot Video Object Segmentation (MVOS).
As shown in (a), the significant variations in object appearance, spatial location, and background across shots pose major challenges in MVOS.
We introduce Cut-VOS, a challenging MVOS benchmark with high transition diversity to support this task.
As shown in (b), on Cut-VOS, SAM2-B+ exhibits a 21.4%
\( \mathcal{J}\)&\( \mathcal{F}\) drop compared to the challenging single-shot MOSE dataset and a 16.4% \( \mathcal{J}\)t drop compared to YouMVOS†.
Upon the observation, we propose a new transition-specific segmentation model, Segment Anything Across Shots ( SAAS) which effectively segments objects in multi-shot videos.
Abstract
This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions.
1. Cut-VOS Benchmark
Benchmark Statistics
Dataset
#Videos
#Objects
#Masks
#Shots
Trans. Frequency
Obj. Categories
Available
YouMVOS-test
30
78
64.6K
2.4K
0.222/s
4
✖
Cut-VOS (Ours)
100
174
10.2K
648
0.346/s
11
✔
Table 1: The basic statistics for the Cut-VOS benchmark.
As shown in Table 1, the Cut-VOS benchmark contains 100 videos, 174 annotated objects, and 10.2K high-quality masks overall. Compared to the previous MVOS dataset YouMVOS, Cut-VOS contains more videos (100 vs. 30) and objects representing more diverse scenarios, and carefully screened, multiple types of transitions with a 1.6 times higher frequency reaching 0.352/s. Besides, different from YouMVOS which solely focus on actors especially human subjects, Cut-VOS contains 11 different categories and 40+ subcategories across both actors and static objects, making it more in-the-wild.
Diverse Object Categories
Figure 2:
The Comparison of object categories. Cut-VOS contains 4 categories
in YouMVOS and 7 new categories.
As shown in Figure 2, the proposed Cut-VOS contains the objcet across 11 categories: Adult, Child, Virtual,
Animal, Vehicle, Tool, Food, Architecture, Furniture, Plants, and Instrument.
The first five categories correspond to actors, while the remaining six belong to static objects, accounting for 62% and 38% of the benchmark, respectively.
Adult
Children
Animal
Vehicle
Tool
Food
Architecture
Furniture
Figure 3: Examples of diverse object categories in the Cut-VOS benchmark.
Transition Types Analysis
Cut away
Cut in
Delayed Cut in
Scene Change
Pitch
Horizon
Close-up View
Distant View
Figure 4: Visualization of 8 significant transition types.
We classify all the shot transitions into 9 diffrent types: cut away, cut in, delayed cut in, scene change,
pitch transformation, horizon transformation, close-up view, distant view, and insignificancy, as shown in Figure 4.
We hope to pinpoint the existing bottlenecks by analyzing the transition types. We find that the insignificancy and cut away
are the most easy for the existing VOS methods, while scene change, close-up view, and distant view are the most challenging.
The Cut-VOS benchmark contains more challenging types and few insignificancy and long duration cut away to keep complexity.
The relevant analysis and statistics are involved in our paper and technical appendix. Clike here to access our paper on arXiv.
2. TMA Strategy and SAAS Method
Figure 5: We first propose
the TMA stategy to enable the training on single-shot videos. Such that the severe data sparsity is alleviated.
TMA automatically generates the samples with different patterns to mimick different transition types.
Examples: (a) Random strong transforms. (b) Single transition across diffrent segments from the same video.
(c) Multiple transitions, conducting a case with cut away and cut in. (d) Single transition to another video,
with random replication and gradual translations.
Figure 6: The architecture of
our proposed SAAS (Segment Anything Across Shots) model. It consists of three new compenents:
Transition Detection Module (TDM), Transition Comprehension Module (TCH), and local memory bank. These moduels
detect and understand the occurring transition and guide the cross-shot segmentation. With the training support
of TMA, SAAS achieves strong multi-shot segmentation capacity.
3. Experiments
Benchmark Results
Table 2: Main results on existing Cut-VOS
methods and our proposed SAAS on YouMVOS and Cut-VOS benchmarks.
We evaluate the representative VOS methods, including Xmem, DEVA, Cutie, and SAM2, along with our proposed
SAAS on both YouMVOS and Cut-VOS benchmarks, as shown in Table 2. * denotes the model is directly trained
on the YTVOS dataset without extra data augmentation. Bold and underlined indicate the best and
the second-best performance in the tested methods. The Results show that SAAS achieves the SOTA performance
on both benchmarks while keeping virtually no degradation in inference speed.
Qualitative Results
Figure 7. Qualitative comparison of some representative cases from Cut-VOS between the
SAAS and the SAM2 methods. (a) shows a case with a delayed cut in transition and an abrupt position shift of target objects.
(b) demonstrates SAAS's better capacity in a crowded scene with complex relations.
SAAS coherently segments the target object among ten similar objects.
BibTeX
Please consider to cite SAAS if it helps your research.
@inproceedings{SAAS2025,
title={Segment Anything Across Shots: A Method and Benchmark},
author={Hu, Hengrui and Ying, Kaining and Ding, Henghui},
booktitle={AAAI},
year={2026}
}