TOSS: High-quality Text-guided Novel View Synthesis from a Single Image

ICLR 2024


Yukai Shi1,3, Jianan Wang3, He Cao2,3,
Boshi Tang1,3, Xianbiao Qi3, Tianyu Yang3, Yukun Huang3, Shilong Liu1,3,
Lei Zhang3, Heung-Yeung Shum1,3

1Tsinghua University    2Hong Kong University of Science and Technology 3International Digital Economy Academy

Abstract


TOSS utilizes text as semantic guidance to further constrain the solution space of NVS, and generates more plausible, controllable, multiview-consistent novel view images from a single image.
Image

TOSS introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero-1-to-3 with more plausible, controllable and multiview-consistent NVS results. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.

Image

The pipeline of TOSS (Left) and our conditioning mechanisms (Right).


Image

Comparing previous image conditioning mechanisms (a-b) and TOSS (c).



Noevel View Synthesis Results on the GSO and RTMV Datasets


Image

Quantitative comparison of single-view novel view synthesis on GSO and RTMV.


Image

3D consistency scores on GSO and RTMV.


Image

Qualitative comparison of single-view NVS, on GSO (Left) and RTMV (Right).


Image

NVS examples using TOSS on Synthetic NeRF dataset.


Image

Random sampled novel views using TOSS.



3D Generation Results on the GSO and RTMV Datasets


Image

Quantitative comparison of single-view 3D reconstruction on GSO and RTMV.


Image

Qualitative comparison of 3D reconstruction on GSO and RTMV.


3D generation results based on in-the-wild images.



Citation



@article{TOSS,
author    = {Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum},
title     = {TOSS:High-quality Text-guided Novel View Synthesis from a Single Image},
journal   = {arXiv:2310.10644},
year      = {2023},
}