SalienTR: A closer look at multi-modal transformer for RGB-T salient object detection

doi:10.1016/j.eswa.2025.128068

Expert Systems with Applications

Volume 286, 15 August 2025, 128068

https://doi.org/10.1016/j.eswa.2025.128068 Get rights and content

Highlights

•
Propose a transformer-based framework for RGB-T/D saliency detection.
•
Design a scalable cross-modal fusion module in modal-space 3D volume.
•
Achieve state-of-the-art results on eight public RGB-T/D datasets.
•
Verify the high robustness on degraded images under challenging scenarios.

Abstract

Recently, transformer-based methods show impressive performance on challenging dense prediction tasks, yet far from perfect. In this paper, we present SalienTR, a novel and effective transformer-based model for high-quality RGB-T salient object detection. First, it builds on two parallel Swin Transformer encoders to extract hierarchical appearance representations of color maps and spatial information of thermal infrared maps. Next, a well-designed cross-modal fusion transformer, namely ComFormer, is devised to mine and fuse two-modality features. Specifically, the proposed ComFormer consists of three consecutive components: (1) A local cross-modal multi-head self-attention (LoC-MSA) to capture local correlations along space-modal dimensions; (2) A global cross-modal multi-head self-attention (GLoC-MSA) to build long-range dependencies between different modalities; (3) A uni-modal convolution (Uni-Conv) that exploits fused information and further learn latent semantics. Besides, we develop a dual-stream decoder that integrates multi-level features to generate both the final salient map and its corresponding edge map. Extensive experiments show that our SalienTR performs well on widely used RGB-T benchmarks and surpasses other state-of-the-art approaches. In addition, SalienTR exhibits high robustness under various challenging scenarios and transferable generalization on RGB-D benchmark datasets.

Introduction

Salient object detection (SOD) has been a fundamental task in computer vision, which aims at capturing and locating the most visually attractive object from input images. SOD is able to support and benefit various computer vision tasks, such as image segmentation (Liu et al., 2023), object detection (Jia, Song, Cao, & Lu, 2023), visual tracking (He & Chen, 2023), etc. RGB-based SOD research has achieved much progress these years (Kousik, Natarajan, Raja, Kallam, Patan, Gandomi, 2021, Peng, Yang, Li, 2022, Qi, Guo, Li, Niu, Qu, 2024, Xia, Sun, Li, Ge, Zhang, Jiang, Zhang, 2024, Yao, Wang, 2023). However, it might encounter some difficulties on condition that the background is cluttered or the light is dim, in which there could be some disturbance of texture and color details in RGB images. Fortunately, other modalities can be used to nicely complement RGB images for better performing SOD task, and the research on SOD has developed from images to multi-modal pairs (Kanwal, Taj, 2023, Wang, Wang, Sun, 2023, Wu, Sun, Xu, Meng, Wang, 2022). For instance, depth map includes more spatial and geometric information as well as three dimensional layout, which can highlight the outline of objects and facilitate the detection; thermal map reflects the thermal infrared radiation above absolute zero, which is more robust to poor illumination conditions. Additionally, severe weather does not have much impact on thermal images; consequently, thermal map can offer all-day stable SOD. As the fusion of RGB image and depth/thermal map provides a more comprehensive understanding in challenging scenarios, how to effectively design cross-modal fusion has gained continuous attention.

In terms of fusion strategies, current methods can be classified into 3 categories, which are early fusion, late fusion and multi-scale fusion. The first two types mainly focus on whether to combine the multi-modal information from the very beginning or at the end of the network, while neglecting multi-level feature interaction. Multi-scale fusion is just such a strategy that utilizes low-level, intermediate and high-level cross-modal features comprehensively, which we adopt to construct our network. The application of convolutional neural networks (CNNs) has promoted saliency detection to a large degree (Ji, Zhang, Zhang, Liu, 2021, Qu, He, Zhang, Tian, Tang, Yang, 2017), but it is limited in capturing long-range dependencies. The transformer-based architecture (Liu, Zhang, Wan, Shao, Han, 2021a, Liu, Wang, Tu, Xiao, Tang, 2021c) that we utilize naturally models global semantics, showing the capability of further pushing forward salient object detection.

Similarly, the design of transformer-based cross-modal interaction can be categorized into 4 classes, as shown in Fig. 1. Standard self-attention builds connections between modalities in an implicit way. As this relationship is relatively vague, the cross-modal complementarity may not be ideal. To make up for this deficiency, cross-attention takes the relations between each element in one modality and all elements in the other modality into account. However, this design overlooks intra-modal relations. The dense connection in fully joint attention remedies this situation. Even so, the redundant links may cause either resource consumption or potential side effects. To overcome these drawbacks, we devise a novel sparse attention mechanism to perform cross-modal interaction more effectively. It first computes self-attention in local blocks along modal-space dimensions and then adopts sparse axial global attention to build relationships among these local blocks. This strategy helps model learn complementary synergies from well-aligned RGB-T pairs, as well as boost computation efficiency.

In this paper, we propose SalienTR, a brand new encoder-decoder framework for RGB-T salient object detection. Concretely, the input image is first split into non-overlapping patches before feeding into encoder, which is comprised of two parallel Swin Transformers as the backbone network to obtain hierarchical features. A cross-modal fusion transformer, aliased as ComFormer, is designed to fuse intermediate feature maps from RGB images and the corresponding thermal maps by extending the self-attention mechanism from the image space to the space-modal 3D volume. To be more specific, a local cross-modal multi-head self-attention (LoC-MSA) module first performs sparse self-attention computation in local blocks over the space-modal volume. Then, a global cross-modal multi-head self-attention (GLoC-MSA) module adopts similar self-attention on feature patches across vertical and horizontal dimensions. Equipped with the above two modules, ComFormer is capable of capturing local cross-modal features and learning global long-range representations. Next, a uni-modal convolution (Uni-Conv) module employs convolution on RGB images and thermal maps, separately, to understand fused information as well as inject convolutional inductive bias. In order to generate high-quality segmentation masks and sharper contours, we present a dual-stream decoder to concatenate multi-modal features from ComFormer and multi-level encoded features from the backbone, and then predict salient map and the corresponding edge map. Comprehensive experiments have indicated the effectiveness and generalization of our SalienTR, which outperforms other state-of-the-art approaches. In summary, our main contributions are threefold:

$•$
A novel transformer-based framework, termed SalienTR, is proposed for RGB-T salient object detection task, which can effectively extract and fuse multi-level features from different modalities, and generate high-quality salient object maps and sharper object boundaries.
$•$
A scalable cross-modal fusion transformer, namely ComFormer, is designed to extend the self-attention mechanism from the image space to the space-modal 3D volume. It computes local attention by considering the neighboring patches across different modalities and then calculates global attention over sparse RGB-T patches along vertical and horizontal dimensions. This allows our SalienTR to better exploit the complementarity of inter- or intra-modality and capture local correlations and long-range dependencies between different modalities.
$•$
We conduct comprehensive experiments on three public RGB-T datasets, achieving state-of-the-art results. In particular, SalienTR exhibits higher robustness than other approaches on degraded images under challenging scenarios. Besides, SalienTR can also be applied to RGB-D SOD tasks and its generalization is further verified.

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Uni-modal salient object detection

Over the past decades, SOD has received extensive attention in computer vision. Initiated by the study on RGB images, this field has gained much progress by means of traditional image processing. These approaches utilize low-level hand-crafted features in principle, especially heuristic prior knowledge like color (Cheng, Mitra, Huang, Torr, & Hu, 2014a), boundary (Zhu, Liang, Wei, & Sun, 2014), and background (Han et al., 2014). Nevertheless, due to the lack of high-level semantics and

Overview

The overall architecture of the proposed SalienTR is shown in Fig. 2. Specifically, it contains two parallel Swin Transformer (Liu et al., 2021b) backbone networks, which are applied to extract hierarchical structural details of RGB images and spatial features of thermal infrared maps, separately. The encoded RGB and thermal infrared features from low to high are denoted as

(F_{1}^{R}, F_{2}^{R}, F_{3}^{R}, F_{4}^{R})

and

(F_{1}^{T}, F_{2}^{T}, F_{3}^{T}, F_{4}^{T})

respectively, which are downsampled by 4, 8, 16, and 32 times. Then,

F_{2}

F_{3}

, and

F_{4}

Datasets and evaluation metrics

Extensive experiments are conducted on three public RGB-T SOD benchmarks, including VT821 (Wang et al., 2018), VT1000 Tu et al. (2019b), and VT5000 Tu et al. (2022b). Our training set is composed of 2500 samples from VT5000, the same settings as (Liu, Tan, He, & Xiao, 2022), while the testing set takes the rest. Quantitative results are reported on four widely used evaluation metrics: S-Measure (S

_{α}

), max F-measure (F

_{β}

), max E-measure(E

_{ϵ}

), and Mean Absolute Error (MAE).

Implementation details

We adjust the resolution

Conclusion

In this paper, we present a new transformer-based model for RGB-T salient object detection, namely SalienTR. Our method not only effectively mines hierarchical representations of different modalities, but also better fuses cross-modal features to generate high-quality salient results. Specifically, it builds on two parallel backbone networks appended with the ComFormer and dual-stream decoder. (1) ComFormer is composed of LoC-MSA, GLoC-MSA and Uni-Conv sequentially, which learns both local and

CRediT authorship contribution statement

Ruohao Guo: Methodology, Validation, Investigation, Resources, Data curation, Writing – original draft. Wenzhen Yue: Methodology, Validation, Investigation, Visualization, Writing – original draft. Liao Qu: Methodology, Investigation, Visualization, Data curation, Writing – original draft. Yanyu Qi: Investigation, Validation, Writing – original draft. Dantong Niu: Investigation, Data curation, Writing – original draft. Xianghua Ying: Conceptualization, Formal analysis, Writing – original draft,

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence work reported in this paper.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No. 62371009, and Beijing Natural Science Foundation under Grant No. L247029.

References (82)

H. Chen et al.
Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-d salient object detection
Pattern Recognition
(2019)
J. Han et al.
Perceptual localization and focus refinement network for RGB-d salient object detection
Expert Systems with Applications
(2025)
X. He et al.
Enhancing discriminative appearance model for visual tracking
Expert Systems with Applications
(2023)
Y. Ji et al.
Cnn-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances
Information Sciences
(2021)
S. Jia et al.
Imdet: Injecting more supervision to centernet-like object detection
Expert Systems with Applications
(2023)
D. Jin et al.
Cafcnet: Cross-modality asymmetric feature complement network for rgb-t salient object detection
Expert Systems with Applications
(2024)
S. Kanwal et al.
Cvit-net: A conformer driven rgb-d salient object detector with operation-wise attention learning
Expert Systems with Applications
(2023)
N. Kousik et al.
Improved salient object detection using hybrid convolution recurrent neural network
Expert Systems with Applications
(2021)
G. Liu et al.
A coarse-to-fine segmentation frame for polyp segmentation via deep and classification features
Expert Systems with Applications
(2023)
P. Peng et al.
Global-prior-guided fusion network for salient object detection
Expert Systems with Applications
(2022)

F. Wang et al.

Dcmnet: Discriminant and cross-modality network for rgb-d salient object detection

Expert Systems with Applications

(2023)

J. Wu et al.

Aggregate interactive learning for RGB-d salient object detection

Expert Systems with Applications

(2022)

C. Xia et al.

Rcnet: Related context-driven network with hierarchical attention for salient object detection

Expert Systems with Applications

(2024)

Z. Yao et al.

Object localization and edge refinement network for salient object detection

Expert Systems with Applications

(2023)

J. Canny

A computational approach to edge detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1986)

L.-C. Chen et al.

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2017)

Q. Chen et al.

Rgb-d salient object detection via 3d convolutional neural networks

Proceedings of the AAAI conference on artificial intelligence

(2020)

T. Chen et al.

Vpdetr: End-to-end vanishing point detection transformers

Proceedings of the AAAI conference on artificial intelligence

(2024)

Y. Chen et al.

Cm-pie: Cross-modal perception for interactive-enhanced audio-visual video parsing

Icassp 2024-2024 IEEE International conference on acoustics, speech and signal processing (icassp)

(2024)

Z. Chen et al.

Global context-aware progressive aggregation network for salient object detection

Proceedings of the AAAI conference on artificial intelligence

(2020)

M.-M. Cheng et al.

Global contrast based salient region detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2014)

M.-M. Cheng et al.

Global contrast based salient region detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2014)

Y. Cheng et al.

Depth enhanced saliency detection method

Proceedings of International conference on internet multimedia computing and service

(2014)

K. Fu et al.

Jl-dcf: Joint learning and densely-cooperative fusion framework for rgb-d salient object detection

Proceedings of the IEEE conference on computer vision and pattern recognition

(2020)

W. Gao et al.

Unified information fusion network for multi-modal RGB-d and RGB-t salient object detection

IEEE Transactions on Circuits and Systems for Video Technology

(2022)

R. Guo et al.

Sotr: Segmenting objects with transformers

Proceedings of the IEEE International conference on computer vision

(2021)

R. Guo et al.

Instance-level panoramic audio-visual saliency detection and ranking

Proceedings of the 32nd ACM International conference on multimedia

(2024)

R. Guo et al.

Open-vocabulary audio-visual semantic segmentation

Proceedings of the 32nd ACM International conference on multimedia

(2024)

Guo, R., Ying, X., Chen, Y., Niu, D., Li, G., Qu, L., Qi, Y., Zhou, J., Xing, B., Yue, W. et al., Audio-visual instance...

R. Guo et al.

UniTR: A unified transformer-based framework for co-object and multi-modal saliency detection

IEEE Transactions on Multimedia

(2024)

J. Han et al.

Background prior-based salient object detection via deep reconstruction residual

IEEE Transactions on Circuits and Systems for Video Technology

(2014)

Ho, J., Kalchbrenner, N., Weissenborn, D., & Salimans, T., Axial attention in multidimensional transformers, (2019)....

Z. Huang et al.

Ccnet: Criss-cross attention for semantic segmentation

Proceedings of the IEEE International conference on computer vision

(2019)

F. Huo et al.

Efficient context-guided stacked refinement network for RGB-t salient object detection

IEEE Transactions on Circuits and Systems for Video Technology

(2022)

W. Ji et al.

Accurate RGB-d salient object detection via collaborative learning

Proceedings of the European conference on computer vision

(2020)

R. Ju et al.

Depth saliency based on anisotropic center-surround difference

Proceedings of the IEEE International conference on image processing

(2014)

G. Li et al.

Hierarchical alternate interaction network for RGB-d salient object detection

IEEE Transactions on Image Processing

(2021)

G. Li et al.

Icnet: Information conversion network for rgb-d based salient object detection

IEEE Transactions on Image Processing

(2020)

N. Li et al.

Saliency detection on light field

Proceedings of the IEEE conference on computer vision and pattern recognition

(2014)

N. Liu et al.

Learning elective self-mutual attention for RGB-d saliency detection

Proceedings of the IEEE conference on computer vision and pattern recognition

(2020)

N. Liu et al.

Visual saliency transformer

Proceedings of the IEEE International conference on computer vision

(2021)

Cited by (3)

Progressive multimodal synergetic fusion network for salient object detection in urban perception
2026, Expert Systems with Applications
Salient object detection (SOD) plays a pivotal role in enabling visual perception for smart city Internet of Things systems. However, robust SOD in open urban settings is severely challenged by complex background clutter, diverse textures, and irregular object boundaries. To tackle these issues, we propose a progressive multimodal synergetic fusion network (PMSFNet) for robust SOD in urban embodiment perception scenarios. PMSFNet adopts a two-stream progressive transformer backbone that enables efficient multimodal feature extraction across different spectral and spatial scales. A cross-modal collaborative fusion (CMCF) module is designed to promote hierarchical semantic interaction between RGB and thermal modalities, while a spatial-spectrum progressive refinement (SSPR) module adaptively enhances saliency localization and boundary sharpness. During training, we incorporate a multi-level supervision strategy that jointly optimizes regional structure and fine-grained edge details. Extensive experiments on three RGB-T datasets demonstrate that PMSFNet consistently outperforms 12 representative methods in both accuracy and boundary precision, showing strong generalization and robustness in diverse environments.
SPMF: a saliency-based pseudo-multimodality fusion model for data-scarce maritime targets detection
2026, Ocean Engineering
Despite the rapid advancement of visual detection, challenges remain in detecting maritime targets due to insufficient foreground-background separation and scarce data; this paper addresses these issues by proposing a Saliency-based Pseudo-Multimodality Fusion detection model (SPMF). Firstly, a spectral enhancement is employed based on ocean spectral absorption preferences, achieving excellent target retention performance. Subsequently, spectrum smooth reconstruction is integrated, yielding high-contrast saliency maps with good anti-aliasing properties. Finally, the salient map, simulating infrared-like features derived from RGB inputs, is incorporated as a pseudo-infrared modality into a cross-modality, cross-level dual backbone network. The experiments were conducted on the SeaDronesSee dataset. Compared to the baseline model, the mAP of SPMF has increased to 53.9 % (+2.0 %), with the PR-AUC rising to 0.879 (+0.02). Additionally, for target categories with small-sample and small-size features, such as Life Saving Appliances (LSA) and Buoys, AP has increased respectively to 38.6 % (+3.0 %) and 56.8 % (+4.6 %); while for Swimmers, Boats, and LSA, APs were the highest among all compared models. Furthermore, experiments on the AFO dataset demonstrate SPMF’s consistent superiority over the baseline. These demonstrate SPMF’s capability to maximally mine target features from single-modality by simulating infrared characteristics, particularly excelling in small-sample, small-size scenarios while reducing reliance on specialized sensors.
A transformer network for video inpainting with optical flow guidance and spatio-temporal decoupling
2026, Multimedia Systems

View full text

SalienTR: A closer look at multi-modal transformer for RGB-T salient object detection

Highlights

Abstract

Introduction

Access through your organization

Section snippets

Uni-modal salient object detection

Overview

Datasets and evaluation metrics

Implementation details

Conclusion

CRediT authorship contribution statement

Declaration of competing interest

Acknowledgment

Pattern Recognition

Expert Systems with Applications

Expert Systems with Applications

Information Sciences

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

A computational approach to edge detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

IEEE Transactions on Pattern Analysis and Machine Intelligence

Rgb-d salient object detection via 3d convolutional neural networks

Proceedings of the AAAI conference on artificial intelligence

Vpdetr: End-to-end vanishing point detection transformers

Proceedings of the AAAI conference on artificial intelligence

Cm-pie: Cross-modal perception for interactive-enhanced audio-visual video parsing

Icassp 2024-2024 IEEE International conference on acoustics, speech and signal processing (icassp)

Global context-aware progressive aggregation network for salient object detection

Proceedings of the AAAI conference on artificial intelligence

Global contrast based salient region detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

Global contrast based salient region detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

Depth enhanced saliency detection method

Proceedings of International conference on internet multimedia computing and service

Jl-dcf: Joint learning and densely-cooperative fusion framework for rgb-d salient object detection

Proceedings of the IEEE conference on computer vision and pattern recognition

Unified information fusion network for multi-modal RGB-d and RGB-t salient object detection

IEEE Transactions on Circuits and Systems for Video Technology

Sotr: Segmenting objects with transformers

Proceedings of the IEEE International conference on computer vision

Instance-level panoramic audio-visual saliency detection and ranking

Proceedings of the 32nd ACM International conference on multimedia

Open-vocabulary audio-visual semantic segmentation

Proceedings of the 32nd ACM International conference on multimedia

UniTR: A unified transformer-based framework for co-object and multi-modal saliency detection

IEEE Transactions on Multimedia

Background prior-based salient object detection via deep reconstruction residual

IEEE Transactions on Circuits and Systems for Video Technology

Ccnet: Criss-cross attention for semantic segmentation

Proceedings of the IEEE International conference on computer vision

Efficient context-guided stacked refinement network for RGB-t salient object detection

IEEE Transactions on Circuits and Systems for Video Technology

Accurate RGB-d salient object detection via collaborative learning

Proceedings of the European conference on computer vision

Depth saliency based on anisotropic center-surround difference

Proceedings of the IEEE International conference on image processing

Hierarchical alternate interaction network for RGB-d salient object detection

IEEE Transactions on Image Processing

Icnet: Information conversion network for rgb-d based salient object detection

IEEE Transactions on Image Processing

Saliency detection on light field

Proceedings of the IEEE conference on computer vision and pattern recognition

Learning elective self-mutual attention for RGB-d saliency detection

Proceedings of the IEEE conference on computer vision and pattern recognition

Visual saliency transformer

Proceedings of the IEEE International conference on computer vision