Tracking-Any-Point (TAP) models aim to track any point through a video —
a crucial task in AR/XR and robotics applications. The recently introduced
TAPNext approach proposes an end-to-end, recurrent transformer architecture
to track points frame-by-frame in a purely online fashion, demonstrating
competitive performance at minimal latency.
However, we show that TAPNext struggles with longer video sequences and also
frequently fails to re-detect query points that reappear after being occluded
or leaving the frame. In this work, we present TAPNext++,
a model that tracks points in sequences that are orders of magnitude longer
while preserving the low memory and compute footprint of the architecture.
We train the recurrent video transformer using several data-driven solutions,
including training on long 1024-frame sequences enabled by sequence parallelism
techniques. We highlight that re-detection performance is a blind spot in the
current literature and introduce a new metric, Re-Detection Average
Jaccard (AJRD), to explicitly evaluate tracking on
re-appearing points. To improve re-detection of points, we introduce tailored
geometric augmentations, such as periodic roll that simulates point re-entries,
and supervising occluded points. We demonstrate that recurrent transformers can
be substantially improved for point tracking and set a new state-of-the-art on
multiple benchmarks.