UNITE | Unified Semantic Transformer for 3D Scene Understanding

UNITE teaser figure — UNITE predicts multiple 3D semantic outputs in one forward pass.

Unified outputs

Semantic + instance signals, open-vocabulary features, and articulation cues in a single model.

Feed-forward and fast

Runs end-to-end on unseen scenes without iterative optimization.

RGB-only prediction

Trained via 2D distillation and multi-view consistency without requiring ground-truth 3D at test time.

3D-consistent features

Multi-view losses encourage point-wise consistency across viewpoints.

Abstract

Holistic 3D scene understanding requires capturing and reasoning about complex, unstructured environments. Most existing approaches are tailored to a single downstream task, which makes it difficult to reuse representations across objectives. We introduce UNITE, a unified semantic transformer for 3D scene understanding: a feed-forward model that consolidates a diverse set of 3D semantic predictions within a single network. Given only RGB images, UNITE infers a 3D point-based reconstruction with dense features and jointly predicts semantic segmentation, instance-aware embeddings, open-vocabulary descriptors, and signals related to affordances and articulation. Training combines 2D distillation from foundation models with self-supervised multi-view constraints that promote 3D-consistent features across views. Across multiple benchmarks, UNITE matches or exceeds task-specific baselines and remains competitive even against methods that rely on ground-truth 3D geometry.

Method

UNITE takes a set of RGB images and predicts a point cloud representation with dense per-point features. A transformer-based backbone produces a shared 3D representation that is decoded into multiple output heads (e.g., semantic labels, instance embeddings, open-vocabulary features, articulation-related signals). Supervision is obtained via 2D distillation on the input views, while multi-view consistency losses align features for pixels that correspond to the same 3D point.

Qualitative results

Examples on ScanNet, ScanNet++, and MultiScan. Rows show different output modalities: (a) semantic and instance segmentation, (b) open-vocabulary retrieval, and (c) articulation prediction.

UNITE Features Articulation/CLIP/Instance/RGB.

BibTeX

@article{koch2025unite,
  title   = {Unified Semantic Transformer for 3D Scene Understanding},
  author  = {Koch, Sebastian and Wald, Johanna and Matsuki, Hidenobu and Hermosilla, Pedro and Ropinski, Timo and Tombari, Federico},
  journal={arXiv preprint arXiv:2512.14364},
  year    = {2025}
}