TRecViT: A Recurrent Video Transformer

Pătrăucean, Viorica; He, Xu Owen; Heyward, Joseph; Zhang, Chuhan; Sajjadi, Mehdi S. M.; Muraru, George-Cristian; Zholus, Artem; Karami, Mahdi; Goroshin, Ross; Chen, Yutian; Osindero, Simon; Carreira, João; Pascanu, Razvan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.14294 (cs)

[Submitted on 18 Dec 2024 (v1), last revised 15 Feb 2026 (this version, v2)]

Title:TRecViT: A Recurrent Video Transformer

Authors:Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu

View PDF HTML (experimental)

Abstract:We propose a novel block for \emph{causal} video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture \emph{TRecViT} is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in real-time. When compared with causal transformer-based models (TSM, RViT) and other recurrent models like LSTM, TRecViT obtains state-of-the-art results on the challenging SSv2 dataset. Code and checkpoints are available online this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2412.14294 [cs.CV]
	(or arXiv:2412.14294v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.14294

Submission history

From: Viorica Patraucean Dr [view email]
[v1] Wed, 18 Dec 2024 19:44:30 UTC (15,102 KB)
[v2] Sun, 15 Feb 2026 19:36:47 UTC (11,953 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TRecViT: A Recurrent Video Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TRecViT: A Recurrent Video Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators