Enhancing Multimodal-Input Object Goal Navigation by Leveraging Large Language Models for Inferring Room-Object Relationship Knowledge

New demos in larger multiple-room house (Update in July 2024)



Leyuan Sun
Asako Kanezaki
Guillaume Caron
Yusuke Yoshiyasu


Image

Proposal concept

Image

Real world experiments with Kobuki mobile robot (RGB-D camera with LiDAR)


Image

System overview

Image

Network architecture


Abstract

Image
Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular map-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, large language models have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a large language model. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms.


Chain-of-Thought Positive and Negative prompts

Image

Chain-of-Thought positive prompts

Image

Chain-of-Thought negative prompts


Image

Object-to-room relationship knowledge from LLM

Image

Conventions for each room category from MP3D dataset


Demos in Habitat simulator

PONI, CVPR 2022

Multi-channel Swin Unet, positive prompts only

Multi-channel Swin Unet, positive w/ negative prompts (ours)


More demos: comparisons between PONI and ours.


Real-world experiment scenario

Mesh reconstruction for illustration purpose

Image

The floor map for the experiments shows the house (50.36㎡) layout. Five starting locations and target objects are marked on the map


Demos in real-world house with Kobuki mobile robot

Find bed

Find toilet

Find plant

Find tv_monitor

Find chair

Find chair



Quantitative evaluation

Image

The improvements in SR and SPL compared with other SOTAs.



Real-world experiment scenario (Update in July 2024)

Mesh reconstruction for illustration purpose

Image

The floor map for the experiments shows the larger house (approx. 90㎡) layout. Different starting locations and target objects are marked on the map.


New demos in larger multiple-room house (Update in July 2024)

Find toilet

Find sofa

Find plant

Find plant


Using voice command to publish target topic.

Footages


Paper and Supplementary Material

Image Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal Navigation
Submitted to Advanced Engineering Informatics
2024 Imapct Factor: 8
(hosted on ArXiv)
SCImago Journal & Country Rank


Citation

If you find this study useful, please cite us:

@misc{sun2024leveraging,
	title={Leveraging Large Language Model-based Room-Object Relationships Knowledge for 
	       Enhancing Multimodal-Input Object Goal Navigation}, 
	author={Leyuan Sun and Asako Kanezaki and Guillaume Caron and Yusuke Yoshiyasu},
	year={2024},
	eprint={2403.14163},
	archivePrefix={arXiv},
	primaryClass={cs.RO}
     }
		


Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.