TAPNext++ sets a new state of the art for online point tracking — robust over long sequences, after re-detection, and through occlusion.

Abstract

Tracking-Any-Point (TAP) models aim to track any point through a video — a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion, demonstrating competitive performance at minimal latency.

However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture.

We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard (AJRD), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks.

Contributions

1   Long-Sequence Training at Scale
We enable end-to-end training of TAPNext on 1024-frame sequences via distributed sequence parallelism — splitting the temporal dimension across multiple GPUs using an efficient associative scan. This reduces GPU communication from 7 sequential steps to just 3, unlocking scalable long-horizon training.
2   Kubric-1024: A Long-Sequence Synthetic Dataset
We introduce Kubric-1024, a new large-scale synthetic dataset of 10,000 videos with 1024 frames each, featuring dynamic objects, realistic lighting via Polyhaven HDRIs, and physics-based motion via PyBullet. Crucially, scenes include velocity bumps to prevent static freeze-outs over long sequences.
3   Re-Detection Average Jaccard (AJRD)
We identify re-detection as a blind spot in existing TAP evaluation. We propose AJRD, a metric that measures tracking quality after a point reappears, conditioned on how long it was undetectable. The metric is computed over eligible reappearance events — those that set a new record invisibility duration for a given track.
4   Augmentations for Re-Detection Robustness
We introduce Roll augmentation — periodically shifting video frames off one boundary and wrapping them to the opposite side — which forces the model to rely on appearance matching rather than spatial proximity for re-detection. Combined with weighted supervision of occluded points and multi-resolution finetuning, this yields state-of-the-art AJRD on RoboTAP and PointOdyssey.

Method

Distributed Sequence Parallelism

TAPNext processes videos frame-by-frame using alternating ViT and State-Space-Model (SSM) blocks. SSM layers propagate temporal context causally and can be parallelised via an associative scan. We shard the 1024-frame input across GPUs — each GPU processes its chunk locally, followed by a logarithmic-time merge across GPU boundaries. This allows forward-and-backward passes in 3 steps on 8 GPUs instead of 7 sequential steps, making 1024-frame end-to-end training practical.

Multi-GPU distributed sequence parallelism setup showing forward and backward passes across 8 GPUs over 3 steps.

Roll Augmentation

To robustly re-detect points that leave and re-enter the frame, we apply roll augmentation: frames are periodically translated and wrapped around image boundaries with a gap margin. This forces query points to exit one side and re-enter from the opposite side, training the model to rely on appearance-based re-detection rather than spatial proximity heuristics. A margin of √(H²/2 + W²/2) prevents the same tracking point from appearing at two locations simultaneously.

Roll augmentation — example 1
Roll augmentation — example 2

Occluded Point Supervision

The standard TAP loss only supervises visible points, giving no incentive to predict plausible locations for occluded points — hurting re-detection. We down-weight the occluded-point position loss by 0.2× to add a soft supervision signal.

Qualitative Comparison

We highlight the three key strengths of TAPNext++ through direct comparison with CoTracker3, Track-On, and BootsTAPNext on curated sequences.

1

Long-Horizon Tracking

By training on 1024-frame sequences, TAPNext++ sustains reliable tracks over long durations where competing online trackers drift or lose points entirely.

TAPNext++ (Ours)
CoTracker3
Track-On
BootsTAPNext

Both clock hands tracked over 1000+ frames. TAPNext++ maintains reliable tracks throughout the full sequence, while competing methods lose the hands well before the end.

2

Re-Detection & Re-Entry

Roll augmentation forces the model to re-detect points that leave and re-enter the frame. Other online trackers consistently fail at this — they rely on spatial proximity rather than appearance matching.

TAPNext++ (Ours)
CoTracker3
Track-On
BootsTAPNext

Juggling balls repeatedly leave and re-enter the frame. TAPNext++, trained with roll augmentation, reliably re-detects all query points on every re-entry. Competing methods fail to recover once points leave the field of view too many times.

3

Occlusion Robustness

Weighted supervision of occluded points provides a soft learning signal that keeps predicted locations plausible during occlusion, enabling accurate re-localization when the point becomes visible again.

TAPNext++ (Ours)
CoTracker3
Track-On
BootsTAPNext

A hand being hidden behind a monitor. TAPNext++ maintains plausible location estimates during occlusion — enabled by weighted occluded-point supervision — and re-localizes precisely when the hand reappears.

All panels show the same input sequence tracked independently by each method.

Qualitative Results on TAP-Vid-DAVIS

We show qualitative tracking results on a diverse selection of TAP-Vid-DAVIS sequences. TAPNext++ runs frame-by-frame at up to 562 FPS (256 query points, window mode on H100), achieving the best speed-accuracy trade-off among all online methods.

india

dance-jump

bmx-bumps

bus

car-shadow

drift-turn

motocross-jump

parkour

scooter-gray

train

Kubric-1024 Dataset

We introduce Kubric-1024, a new large-scale synthetic dataset of 10,000 videos at 1024 frames each. Each scene uses a random HDRI from Polyhaven for lighting and populates 10–20 static and 1–10 dynamic objects from the Google Scanned Objects dataset. A physics-based velocity-bump mechanism keeps objects in motion throughout long sequences to prevent scenes from becoming static. Training on Kubric-1024 in combination with PointOdyssey yields the best overall performance across all benchmarks.

High-Resolution Tracking

TAPNext uses learned 2D positional embeddings initialized for 256×256 inputs. To support 512×512 inference, we bicubically interpolate these embeddings to the larger spatial grid and finetune the model at 512×512. The resulting variant improves performance on RoboTAP (AJ 66.0), approaching CoTracker3 (66.4), while retaining the low memory and compute footprint of the base model.

Tape-dispenser sequence tracked at 512×512 resolution. The denser patch grid captures fine surface texture, enabling precise localisation of small features throughout the sequence.

Re-Detection Average Jaccard (AJRD)

Existing TAP metrics such as Average Jaccard (AJ) and Occlusion Accuracy (OA) average performance over all frames, including many frames where a point is stably visible. This obscures failures specifically on re-detection — tracking a point correctly after it re-appears.

We define AJRD as the mean AJ over eligible reappearance events, where an event is eligible if its invisibility duration di exceeds all previous disappearances for that track (i.e., it sets a new record). This focuses evaluation on progressively harder re-detections and is insensitive to trivially short occlusions.

We report AJRD as the mean over dmin ∈ {1, 4, 16, 64, 256}. On RoboTAP, TAPNext++ achieves state-of-the-art AJRD of 54.6 (512×512 model) and 52.4 (256×256 model), compared to 51.8 for CoTracker3 and 47.1 for BootsTAPNext — a consequence of our roll augmentation directly targeting this failure mode.

54.6
AJRD on RoboTAP
(SOTA)
66.6
AJ on DAVIS
(SOTA online)
562 FPS
Throughput
(256 pts, H100)
72.2
Mean δavg
(best across all benchmarks)

Quantitative Results

We compare against state-of-the-art online tracking methods across five standard TAP benchmarks. TAPNext++ achieves the highest mean δavg (72.2) across all benchmarks, state-of-the-art AJ on DAVIS and RGB-Stacking, and the best AJRD on both PointOdyssey and RoboTAP — all while running at the lowest latency (5.18 ms/frame for 256 query points).

Method Res. PointOdyssey DAVIS RGB-Stacking RoboTAP Kinetics Mean δavg
δavg Surv.↑ AJRD AJ↑ δavg OA↑ AJ↑ δavg OA↑ AJ↑ δavg OA↑ AJRD AJ↑ δavg OA↑
PIPsv2 256×256 21.538.1 63.4 58.5 63.0 51.6
CoTracker2 384×512 30.255.23.9 62.275.789.3 67.478.985.2 58.670.687.033.3 48.864.585.8 64.0
CoTracker3 (Online) 384×512 44.556.321.8 63.876.390.2 71.783.690.2 66.478.890.851.8 55.868.588.3 70.3
Track-On 384×512 35.447.59.9 65.078.090.8 71.485.291.7 63.576.489.444.4 53.967.387.8 68.5
BootsTAPNext-B 256×256 9.913.00.6 65.278.591.2 66.278.386.8 62.674.388.447.1 57.370.687.4 62.3
TAPNext++ (256) 256×256 52.667.923.3 66.679.992.1 73.484.895.1 61.175.289.652.4 54.468.789.0 72.2

Table 1: Online tracking comparison across five benchmarks. Bold = best, underlined = second best.

BibTeX

@inproceedings{tapnextpp2026,
  title     = {{TAPNext++}: What's Next for Tracking Any Point ({TAP})?},
  author    = {Sebastian Jung and Artem Zholus and Martin Sundermeyer and Carl Doersch and Ross Goroshin and David Joseph Tan and Sarath Chandar and Rudolph Triebel and Federico Tombari},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR) -- Findings},
  year      = {2026},
}