SalienTR: A closer look at multi-modal transformer for RGB-T salient object detection
Introduction
Salient object detection (SOD) has been a fundamental task in computer vision, which aims at capturing and locating the most visually attractive object from input images. SOD is able to support and benefit various computer vision tasks, such as image segmentation (Liu et al., 2023), object detection (Jia, Song, Cao, & Lu, 2023), visual tracking (He & Chen, 2023), etc. RGB-based SOD research has achieved much progress these years (Kousik, Natarajan, Raja, Kallam, Patan, Gandomi, 2021, Peng, Yang, Li, 2022, Qi, Guo, Li, Niu, Qu, 2024, Xia, Sun, Li, Ge, Zhang, Jiang, Zhang, 2024, Yao, Wang, 2023). However, it might encounter some difficulties on condition that the background is cluttered or the light is dim, in which there could be some disturbance of texture and color details in RGB images. Fortunately, other modalities can be used to nicely complement RGB images for better performing SOD task, and the research on SOD has developed from images to multi-modal pairs (Kanwal, Taj, 2023, Wang, Wang, Sun, 2023, Wu, Sun, Xu, Meng, Wang, 2022). For instance, depth map includes more spatial and geometric information as well as three dimensional layout, which can highlight the outline of objects and facilitate the detection; thermal map reflects the thermal infrared radiation above absolute zero, which is more robust to poor illumination conditions. Additionally, severe weather does not have much impact on thermal images; consequently, thermal map can offer all-day stable SOD. As the fusion of RGB image and depth/thermal map provides a more comprehensive understanding in challenging scenarios, how to effectively design cross-modal fusion has gained continuous attention.
In terms of fusion strategies, current methods can be classified into 3 categories, which are early fusion, late fusion and multi-scale fusion. The first two types mainly focus on whether to combine the multi-modal information from the very beginning or at the end of the network, while neglecting multi-level feature interaction. Multi-scale fusion is just such a strategy that utilizes low-level, intermediate and high-level cross-modal features comprehensively, which we adopt to construct our network. The application of convolutional neural networks (CNNs) has promoted saliency detection to a large degree (Ji, Zhang, Zhang, Liu, 2021, Qu, He, Zhang, Tian, Tang, Yang, 2017), but it is limited in capturing long-range dependencies. The transformer-based architecture (Liu, Zhang, Wan, Shao, Han, 2021a, Liu, Wang, Tu, Xiao, Tang, 2021c) that we utilize naturally models global semantics, showing the capability of further pushing forward salient object detection.
Similarly, the design of transformer-based cross-modal interaction can be categorized into 4 classes, as shown in Fig. 1. Standard self-attention builds connections between modalities in an implicit way. As this relationship is relatively vague, the cross-modal complementarity may not be ideal. To make up for this deficiency, cross-attention takes the relations between each element in one modality and all elements in the other modality into account. However, this design overlooks intra-modal relations. The dense connection in fully joint attention remedies this situation. Even so, the redundant links may cause either resource consumption or potential side effects. To overcome these drawbacks, we devise a novel sparse attention mechanism to perform cross-modal interaction more effectively. It first computes self-attention in local blocks along modal-space dimensions and then adopts sparse axial global attention to build relationships among these local blocks. This strategy helps model learn complementary synergies from well-aligned RGB-T pairs, as well as boost computation efficiency.
In this paper, we propose SalienTR, a brand new encoder-decoder framework for RGB-T salient object detection. Concretely, the input image is first split into non-overlapping patches before feeding into encoder, which is comprised of two parallel Swin Transformers as the backbone network to obtain hierarchical features. A cross-modal fusion transformer, aliased as ComFormer, is designed to fuse intermediate feature maps from RGB images and the corresponding thermal maps by extending the self-attention mechanism from the image space to the space-modal 3D volume. To be more specific, a local cross-modal multi-head self-attention (LoC-MSA) module first performs sparse self-attention computation in local blocks over the space-modal volume. Then, a global cross-modal multi-head self-attention (GLoC-MSA) module adopts similar self-attention on feature patches across vertical and horizontal dimensions. Equipped with the above two modules, ComFormer is capable of capturing local cross-modal features and learning global long-range representations. Next, a uni-modal convolution (Uni-Conv) module employs convolution on RGB images and thermal maps, separately, to understand fused information as well as inject convolutional inductive bias. In order to generate high-quality segmentation masks and sharper contours, we present a dual-stream decoder to concatenate multi-modal features from ComFormer and multi-level encoded features from the backbone, and then predict salient map and the corresponding edge map. Comprehensive experiments have indicated the effectiveness and generalization of our SalienTR, which outperforms other state-of-the-art approaches. In summary, our main contributions are threefold:
- A novel transformer-based framework, termed SalienTR, is proposed for RGB-T salient object detection task, which can effectively extract and fuse multi-level features from different modalities, and generate high-quality salient object maps and sharper object boundaries.
- A scalable cross-modal fusion transformer, namely ComFormer, is designed to extend the self-attention mechanism from the image space to the space-modal 3D volume. It computes local attention by considering the neighboring patches across different modalities and then calculates global attention over sparse RGB-T patches along vertical and horizontal dimensions. This allows our SalienTR to better exploit the complementarity of inter- or intra-modality and capture local correlations and long-range dependencies between different modalities.
- We conduct comprehensive experiments on three public RGB-T datasets, achieving state-of-the-art results. In particular, SalienTR exhibits higher robustness than other approaches on degraded images under challenging scenarios. Besides, SalienTR can also be applied to RGB-D SOD tasks and its generalization is further verified.
Access through your organization
Check access to the full text by signing in through your organization.
Section snippets
Uni-modal salient object detection
Over the past decades, SOD has received extensive attention in computer vision. Initiated by the study on RGB images, this field has gained much progress by means of traditional image processing. These approaches utilize low-level hand-crafted features in principle, especially heuristic prior knowledge like color (Cheng, Mitra, Huang, Torr, & Hu, 2014a), boundary (Zhu, Liang, Wei, & Sun, 2014), and background (Han et al., 2014). Nevertheless, due to the lack of high-level semantics and
Overview
The overall architecture of the proposed SalienTR is shown in Fig. 2. Specifically, it contains two parallel Swin Transformer (Liu et al., 2021b) backbone networks, which are applied to extract hierarchical structural details of RGB images and spatial features of thermal infrared maps, separately. The encoded RGB and thermal infrared features from low to high are denoted as and respectively, which are downsampled by 4, 8, 16, and 32 times. Then, , , and
Datasets and evaluation metrics
Extensive experiments are conducted on three public RGB-T SOD benchmarks, including VT821 (Wang et al., 2018), VT1000 Tu et al. (2019b), and VT5000 Tu et al. (2022b). Our training set is composed of 2500 samples from VT5000, the same settings as (Liu, Tan, He, & Xiao, 2022), while the testing set takes the rest. Quantitative results are reported on four widely used evaluation metrics: S-Measure (S), max F-measure (F), max E-measure(E), and Mean Absolute Error (MAE).
Implementation details
We adjust the resolution
Conclusion
In this paper, we present a new transformer-based model for RGB-T salient object detection, namely SalienTR. Our method not only effectively mines hierarchical representations of different modalities, but also better fuses cross-modal features to generate high-quality salient results. Specifically, it builds on two parallel backbone networks appended with the ComFormer and dual-stream decoder. (1) ComFormer is composed of LoC-MSA, GLoC-MSA and Uni-Conv sequentially, which learns both local and
CRediT authorship contribution statement
Ruohao Guo: Methodology, Validation, Investigation, Resources, Data curation, Writing – original draft. Wenzhen Yue: Methodology, Validation, Investigation, Visualization, Writing – original draft. Liao Qu: Methodology, Investigation, Visualization, Data curation, Writing – original draft. Yanyu Qi: Investigation, Validation, Writing – original draft. Dantong Niu: Investigation, Data curation, Writing – original draft. Xianghua Ying: Conceptualization, Formal analysis, Writing – original draft,
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence work reported in this paper.
Acknowledgment
This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No. 62371009, and Beijing Natural Science Foundation under Grant No. L247029.
References (82)
- et al.
Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-d salient object detection
Pattern Recognition
(2019) - et al.
Perceptual localization and focus refinement network for RGB-d salient object detection
Expert Systems with Applications
(2025) - et al.
Enhancing discriminative appearance model for visual tracking
Expert Systems with Applications
(2023) - et al.
Cnn-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances
Information Sciences
(2021) - et al.
Imdet: Injecting more supervision to centernet-like object detection
Expert Systems with Applications
(2023) - et al.
Cafcnet: Cross-modality asymmetric feature complement network for rgb-t salient object detection
Expert Systems with Applications
(2024) - et al.
Cvit-net: A conformer driven rgb-d salient object detector with operation-wise attention learning
Expert Systems with Applications
(2023) - et al.
Improved salient object detection using hybrid convolution recurrent neural network
Expert Systems with Applications
(2021) - et al.
A coarse-to-fine segmentation frame for polyp segmentation via deep and classification features
Expert Systems with Applications
(2023) - et al.
Global-prior-guided fusion network for salient object detection
Expert Systems with Applications
(2022)
Dcmnet: Discriminant and cross-modality network for rgb-d salient object detection
Expert Systems with Applications
(2023)
Aggregate interactive learning for RGB-d salient object detection
Expert Systems with Applications
(2022)
Rcnet: Related context-driven network with hierarchical attention for salient object detection
Expert Systems with Applications
(2024)
Object localization and edge refinement network for salient object detection
Expert Systems with Applications
(2023)
A computational approach to edge detection
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1986)
Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2017)
Rgb-d salient object detection via 3d convolutional neural networks
Proceedings of the AAAI conference on artificial intelligence
(2020)
Vpdetr: End-to-end vanishing point detection transformers
Proceedings of the AAAI conference on artificial intelligence
(2024)
Cm-pie: Cross-modal perception for interactive-enhanced audio-visual video parsing
Icassp 2024-2024 IEEE International conference on acoustics, speech and signal processing (icassp)
(2024)
Global context-aware progressive aggregation network for salient object detection
Proceedings of the AAAI conference on artificial intelligence
(2020)
Global contrast based salient region detection
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2014)
Global contrast based salient region detection
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2014)
Depth enhanced saliency detection method
Proceedings of International conference on internet multimedia computing and service
(2014)
Jl-dcf: Joint learning and densely-cooperative fusion framework for rgb-d salient object detection
Proceedings of the IEEE conference on computer vision and pattern recognition
(2020)
Unified information fusion network for multi-modal RGB-d and RGB-t salient object detection
IEEE Transactions on Circuits and Systems for Video Technology
(2022)
Sotr: Segmenting objects with transformers
Proceedings of the IEEE International conference on computer vision
(2021)
Instance-level panoramic audio-visual saliency detection and ranking
Proceedings of the 32nd ACM International conference on multimedia
(2024)
Open-vocabulary audio-visual semantic segmentation
Proceedings of the 32nd ACM International conference on multimedia
(2024)
UniTR: A unified transformer-based framework for co-object and multi-modal saliency detection
IEEE Transactions on Multimedia
(2024)
Background prior-based salient object detection via deep reconstruction residual
IEEE Transactions on Circuits and Systems for Video Technology
(2014)
Ccnet: Criss-cross attention for semantic segmentation
Proceedings of the IEEE International conference on computer vision
(2019)
Efficient context-guided stacked refinement network for RGB-t salient object detection
IEEE Transactions on Circuits and Systems for Video Technology
(2022)
Accurate RGB-d salient object detection via collaborative learning
Proceedings of the European conference on computer vision
(2020)
Depth saliency based on anisotropic center-surround difference
Proceedings of the IEEE International conference on image processing
(2014)
Hierarchical alternate interaction network for RGB-d salient object detection
IEEE Transactions on Image Processing
(2021)
Icnet: Information conversion network for rgb-d based salient object detection
IEEE Transactions on Image Processing
(2020)
Saliency detection on light field
Proceedings of the IEEE conference on computer vision and pattern recognition
(2014)
Learning elective self-mutual attention for RGB-d saliency detection
Proceedings of the IEEE conference on computer vision and pattern recognition
(2020)
Visual saliency transformer
Proceedings of the IEEE International conference on computer vision
(2021)
Cited by (3)
Progressive multimodal synergetic fusion network for salient object detection in urban perception
2026, Expert Systems with Applications
© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
