ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Bucher, Martin JJ.; Armeni, Iro

Computer Science > Computer Vision and Pattern Recognition

arXiv:2506.02459 (cs)

[Submitted on 3 Jun 2025 (v1), last revised 22 Mar 2026 (this version, v5)]

Title:ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Authors:Martin JJ. Bucher, Iro Armeni

View PDF HTML (experimental)

Abstract:Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scene generation either oversimplify object semantics through one-hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM-based methods enable richer semantics via natural language, but lack editing functionality, are limited to rectangular layouts, or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for autoregressive text-driven 3D indoor scene synthesis and editing. Our approach features a compact structured scene representation with explicit room boundaries that enables asset-agnostic deployment and frames scene manipulation as a next-token prediction task, supporting object addition, removal, and swapping via natural language. We employ supervised fine-tuning with a preference alignment stage to train a specialized language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. We further introduce a voxelization-based evaluation metric capturing fine-grained geometric violations beyond 3D bounding boxes. Experiments surpass state-of-the-art on object addition and achieve superior human-perceived quality on the application of full scene synthesis, despite not being trained on it.

Comments:	36 pages, 19 figures, 11 tables (incl. appendix)
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
ACM classes:	I.2.10; I.2.7
Cite as:	arXiv:2506.02459 [cs.CV]
	(or arXiv:2506.02459v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2506.02459

Submission history

From: Martin JJ. Bucher [view email]
[v1] Tue, 3 Jun 2025 05:22:04 UTC (11,877 KB)
[v2] Tue, 10 Jun 2025 20:08:12 UTC (11,878 KB)
[v3] Thu, 25 Sep 2025 20:16:36 UTC (11,903 KB)
[v4] Mon, 1 Dec 2025 19:49:04 UTC (11,924 KB)
[v5] Sun, 22 Mar 2026 23:03:52 UTC (11,242 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators