Preprint

UNITE

Unified Semantic Transformer for 3D Scene Understanding

A single feed-forward model that unifies multiple 3D semantic tasks from RGB input.

1Ulm University 2Google 3TU Vienna 4TU Munich
UNITE teaser figure
UNITE predicts multiple 3D semantic outputs in one forward pass.

Unified outputs

Semantic + instance signals, open-vocabulary features, and articulation cues in a single model.

Feed-forward and fast

Runs end-to-end on unseen scenes without iterative optimization.

RGB-only prediction

Trained via 2D distillation and multi-view consistency without requiring ground-truth 3D at test time.

3D-consistent features

Multi-view losses encourage point-wise consistency across viewpoints.

Abstract

Holistic 3D scene understanding requires capturing and reasoning about complex, unstructured environments. Most existing approaches are tailored to a single downstream task, which makes it difficult to reuse representations across objectives. We introduce UNITE, a unified semantic transformer for 3D scene understanding: a feed-forward model that consolidates a diverse set of 3D semantic predictions within a single network. Given only RGB images, UNITE infers a 3D point-based reconstruction with dense features and jointly predicts semantic segmentation, instance-aware embeddings, open-vocabulary descriptors, and signals related to affordances and articulation. Training combines 2D distillation from foundation models with self-supervised multi-view constraints that promote 3D-consistent features across views. Across multiple benchmarks, UNITE matches or exceeds task-specific baselines and remains competitive even against methods that rely on ground-truth 3D geometry.

Method

UNITE takes a set of RGB images and predicts a point cloud representation with dense per-point features. A transformer-based backbone produces a shared 3D representation that is decoded into multiple output heads (e.g., semantic labels, instance embeddings, open-vocabulary features, articulation-related signals). Supervision is obtained via 2D distillation on the input views, while multi-view consistency losses align features for pixels that correspond to the same 3D point.

UNITE method overview

Qualitative results

Examples on ScanNet, ScanNet++, and MultiScan. Rows show different output modalities: (a) semantic and instance segmentation, (b) open-vocabulary retrieval, and (c) articulation prediction.

Qualitative results overview grid

UNITE Features Articulation/CLIP/Instance/RGB.

Articulation prediction Open-vocabulary features Instance segmentation RGB input

BibTeX

@article{koch2025unite,
  title   = {Unified Semantic Transformer for 3D Scene Understanding},
  author  = {Koch, Sebastian and Wald, Johanna and Matsuki, Hidenobu and Hermosilla, Pedro and Ropinski, Timo and Tombari, Federico},
  journal={arXiv preprint arXiv:2512.14364},
  year    = {2025}
}