TAPNext++: What's Next for Tracking Any Point (TAP)?

Abstract

Tracking-Any-Point (TAP) models aim to track any point through a video — a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion, demonstrating competitive performance at minimal latency.

However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture.

We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard (AJ_RD), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks.

Contributions

1 Long-Sequence Training at Scale

We enable end-to-end training of TAPNext on 1024-frame sequences via distributed sequence parallelism — splitting the temporal dimension across multiple GPUs using an efficient associative scan. This reduces GPU communication from 7 sequential steps to just 3, unlocking scalable long-horizon training.

2 Kubric-1024: A Long-Sequence Synthetic Dataset

We introduce Kubric-1024, a new large-scale synthetic dataset of 10,000 videos with 1024 frames each, featuring dynamic objects, realistic lighting via Polyhaven HDRIs, and physics-based motion via PyBullet. Crucially, scenes include velocity bumps to prevent static freeze-outs over long sequences.

3 Re-Detection Average Jaccard (AJ_RD)

We identify re-detection as a blind spot in existing TAP evaluation. We propose AJ_RD, a metric that measures tracking quality after a point reappears, conditioned on how long it was undetectable. The metric is computed over eligible reappearance events — those that set a new record invisibility duration for a given track.

4 Augmentations for Re-Detection Robustness

We introduce Roll augmentation — periodically shifting video frames off one boundary and wrapping them to the opposite side — which forces the model to rely on appearance matching rather than spatial proximity for re-detection. Combined with weighted supervision of occluded points and multi-resolution finetuning, this yields state-of-the-art AJ_RD on RoboTAP and PointOdyssey.

Method

Distributed Sequence Parallelism

TAPNext processes videos frame-by-frame using alternating ViT and State-Space-Model (SSM) blocks. SSM layers propagate temporal context causally and can be parallelised via an associative scan. We shard the 1024-frame input across GPUs — each GPU processes its chunk locally, followed by a logarithmic-time merge across GPU boundaries. This allows forward-and-backward passes in 3 steps on 8 GPUs instead of 7 sequential steps, making 1024-frame end-to-end training practical.

Roll Augmentation

To robustly re-detect points that leave and re-enter the frame, we apply roll augmentation: frames are periodically translated and wrapped around image boundaries with a gap margin. This forces query points to exit one side and re-enter from the opposite side, training the model to rely on appearance-based re-detection rather than spatial proximity heuristics. A margin of √(H²/2 + W²/2) prevents the same tracking point from appearing at two locations simultaneously.

Roll augmentation — example 1

Roll augmentation — example 2

Occluded Point Supervision

The standard TAP loss only supervises visible points, giving no incentive to predict plausible locations for occluded points — hurting re-detection. We down-weight the occluded-point position loss by 0.2× to add a soft supervision signal.

Qualitative Comparison

We highlight the three key strengths of TAPNext++ through direct comparison with CoTracker3, Track-On, and BootsTAPNext on curated sequences.

1

Long-Horizon Tracking

By training on 1024-frame sequences, TAPNext++ sustains reliable tracks over long durations where competing online trackers drift or lose points entirely.

Clock
Fire Extinguisher

TAPNext++ (Ours)

CoTracker3

Track-On

BootsTAPNext

Both clock hands tracked over 1000+ frames. TAPNext++ maintains reliable tracks throughout the full sequence, while competing methods lose the hands well before the end.

TAPNext++ (Ours)

CoTracker3

Track-On

BootsTAPNext

Surface points tracked on a wall and a fire extinguisher over a long indoor sequence. Competing methods fail at re-detecting the initial scene.

2

Re-Detection & Re-Entry

Roll augmentation forces the model to re-detect points that leave and re-enter the frame. Other online trackers consistently fail at this — they rely on spatial proximity rather than appearance matching.

Juggling
Tesa
Cups & Balls

TAPNext++ (Ours)

CoTracker3

Track-On

BootsTAPNext

Juggling balls repeatedly leave and re-enter the frame. TAPNext++, trained with roll augmentation, reliably re-detects all query points on every re-entry. Competing methods fail to recover once points leave the field of view too many times.

TAPNext++ (Ours)

CoTracker3

Track-On

BootsTAPNext

A tape dispenser moved in and out of the frame. TAPNext++ correctly re-localizes tracked points after re-entry using appearance-based matching rather than spatial proximity.

TAPNext++ (Ours)

CoTracker3

Track-On

BootsTAPNext

The classic cups and balls game. While TAPNext++ is not able to track the ball correctly under occlusion, it is the only method able to re-detect the ball after the cup is lifted.

3

Occlusion Robustness

Weighted supervision of occluded points provides a soft learning signal that keeps predicted locations plausible during occlusion, enabling accurate re-localization when the point becomes visible again.

Hiding Hand
Post-It

TAPNext++ (Ours)

CoTracker3

Track-On

BootsTAPNext

A hand being hidden behind a monitor. TAPNext++ maintains plausible location estimates during occlusion — enabled by weighted occluded-point supervision — and re-localizes precisely when the hand reappears.

TAPNext++ (Ours)

CoTracker3

Track-On

BootsTAPNext

Post-it note partially occluded by a monitor. TAPNext++ predicted correct track positions even during occlusion.

All panels show the same input sequence tracked independently by each method.

Qualitative Results on TAP-Vid-DAVIS

We show qualitative tracking results on a diverse selection of TAP-Vid-DAVIS sequences. TAPNext++ runs frame-by-frame at up to 562 FPS (256 query points, window mode on H100), achieving the best speed-accuracy trade-off among all online methods.

india

dance-jump

bmx-bumps

bus

car-shadow

drift-turn

motocross-jump

parkour

scooter-gray

train

Kubric-1024 Dataset

We introduce Kubric-1024, a new large-scale synthetic dataset of 10,000 videos at 1024 frames each. Each scene uses a random HDRI from Polyhaven for lighting and populates 10–20 static and 1–10 dynamic objects from the Google Scanned Objects dataset. A physics-based velocity-bump mechanism keeps objects in motion throughout long sequences to prevent scenes from becoming static. Training on Kubric-1024 in combination with PointOdyssey yields the best overall performance across all benchmarks.

High-Resolution Tracking

TAPNext uses learned 2D positional embeddings initialized for 256×256 inputs. To support 512×512 inference, we bicubically interpolate these embeddings to the larger spatial grid and finetune the model at 512×512. The resulting variant improves performance on RoboTAP (AJ 66.0), approaching CoTracker3 (66.4), while retaining the low memory and compute footprint of the base model.

Tape-dispenser sequence tracked at 512×512 resolution. The denser patch grid captures fine surface texture, enabling precise localisation of small features throughout the sequence.

Re-Detection Average Jaccard (AJ_RD)

Existing TAP metrics such as Average Jaccard (AJ) and Occlusion Accuracy (OA) average performance over all frames, including many frames where a point is stably visible. This obscures failures specifically on re-detection — tracking a point correctly after it re-appears.

We define AJ_RD as the mean AJ over eligible reappearance events, where an event is eligible if its invisibility duration d_i exceeds all previous disappearances for that track (i.e., it sets a new record). This focuses evaluation on progressively harder re-detections and is insensitive to trivially short occlusions.

We report AJ_RD as the mean over d_min ∈ {1, 4, 16, 64, 256}. On RoboTAP, TAPNext++ achieves state-of-the-art AJ_RD of 54.6 (512×512 model) and 52.4 (256×256 model), compared to 51.8 for CoTracker3 and 47.1 for BootsTAPNext — a consequence of our roll augmentation directly targeting this failure mode.

54.6

AJ_RD on RoboTAP
(SOTA)

66.6

AJ on DAVIS
(SOTA online)

562 FPS

Throughput
(256 pts, H100)

72.2

Mean δ^avg
(best across all benchmarks)

Quantitative Results

We compare against state-of-the-art online tracking methods across five standard TAP benchmarks. TAPNext++ achieves the highest mean δ^avg (72.2) across all benchmarks, state-of-the-art AJ on DAVIS and RGB-Stacking, and the best AJ_RD on both PointOdyssey and RoboTAP — all while running at the lowest latency (5.18 ms/frame for 256 query points).

Method	Res.	PointOdyssey			DAVIS			RGB-Stacking			RoboTAP				Kinetics			Mean δ^avg↑
Method	Res.	δ^avg↑	Surv.↑	AJ_RD↑	AJ↑	δ^avg↑	OA↑	AJ↑	δ^avg↑	OA↑	AJ↑	δ^avg↑	OA↑	AJ_RD↑	AJ↑	δ^avg↑	OA↑	Mean δ^avg↑
PIPsv2	256×256	21.5	38.1	—	—	63.4	—	—	58.5	—	—	63.0	—	—	—	—	—	51.6
CoTracker2	384×512	30.2	55.2	3.9	62.2	75.7	89.3	67.4	78.9	85.2	58.6	70.6	87.0	33.3	48.8	64.5	85.8	64.0
CoTracker3 (Online)	384×512	44.5	56.3	21.8	63.8	76.3	90.2	71.7	83.6	90.2	66.4	78.8	90.8	51.8	55.8	68.5	88.3	70.3
Track-On	384×512	35.4	47.5	9.9	65.0	78.0	90.8	71.4	85.2	91.7	63.5	76.4	89.4	44.4	53.9	67.3	87.8	68.5
BootsTAPNext-B	256×256	9.9	13.0	0.6	65.2	78.5	91.2	66.2	78.3	86.8	62.6	74.3	88.4	47.1	57.3	70.6	87.4	62.3
TAPNext++ (256)	256×256	52.6	67.9	23.3	66.6	79.9	92.1	73.4	84.8	95.1	61.1	75.2	89.6	52.4	54.4	68.7	89.0	72.2

Table 1: Online tracking comparison across five benchmarks. Bold = best, underlined = second best.

BibTeX

@inproceedings{tapnextpp2026,
  title     = {{TAPNext++}: What's Next for Tracking Any Point ({TAP})?},
  author    = {Sebastian Jung and Artem Zholus and Martin Sundermeyer and Carl Doersch and Ross Goroshin and David Joseph Tan and Sarath Chandar and Rudolph Triebel and Federico Tombari},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR) -- Findings},
  year      = {2026},
}

TAPNext++: What's Next for Tracking Any Point (TAP)?

TAPNext++ sets a new state of the art for online point tracking — robust over long sequences, after re-detection, and through occlusion.

Abstract

Contributions

Method

Distributed Sequence Parallelism

Roll Augmentation

Occluded Point Supervision

Qualitative Comparison

Long-Horizon Tracking

Re-Detection & Re-Entry

Occlusion Robustness

Qualitative Results on TAP-Vid-DAVIS

Kubric-1024 Dataset

High-Resolution Tracking

Re-Detection Average Jaccard (AJRD)

Quantitative Results

BibTeX

Re-Detection Average Jaccard (AJ_RD)