Learning to Act from Actionless Videos through Dense Correspondences

Framework Overview

(a) Our model takes the RGBD observation of the current environmental state and a textual goal description as its input.
(b) It first synthesizes a video of imagined execution of the task using a diffusion model.
(c) Next, it estimates the optical flow between adjacent frames in the video.
(d) Finally, it leverages the optical flow as dense correspondences between frames and the depth of the first frame to compute SE(3) transformations of the target object, and subsequently, robot arm commands.

Real-World Franka Emika Panda Arm with Bridge Dataset

We train our video generation model on the Bridge data (Ebert et al., 2022), and perform evaluation on a real-world Franka Emika Panda tabletop manipulation environment.

Synthesized Videos

Robot Executions

Task: put apple in plate

Task: put banana in plate

Task: put peach in blue bowl

Meta-World

We train our video generation model on 165 videos of 11 tasks. We evaluate on robot manipulation tasks in Meta-World (Yu et al., 2019) simulated benchmark.

Synthesized Videos

Robot Executions

Task: Assembly

Task: Door Open

Task: Hammer

Task: Shelf Place

iTHOR

We train our video generation model on 240 videos of 12 target objects. We evaluate on object navigation tasks in iTHOR (Kolve et al., 2017) simulated benchmark.

Synthesized Videos

Robot Navigation

Task: Pillow

Task: Soap Bar

Task: Television

Task: Toaster

Cross-Embodiment Learning (Visual Pusher)

We train our video generation model on ~200 actionless human pushing videos and evaluate in Visual Pusher (Schmeckpeper et al., 2021, Zakka et al., 2022) robot environment.

Failed executions

input image

video plan

execution

input image

video plan

execution

Successful executions

input image

video plan

execution

input image

video plan

execution

Zero-Shot Generalization on Real-World Scene with Bridge Model

We show that our video diffusion model trained on Bridge data (mostly toy kitchen) already can generalize to complex real-world kitchen scenarios. Note that the videos are blurry since the original video resolution is low (48x64).

Task: pick up banana

generated video

Task: put lid on pot

generated video

Task: put pot in sink

generated video

Extended Analysis and Ablation Studies

Comparison of First-Frame Conditioning Strategy

We compare our proposed first-frame conditioning strategy (cat_c) with the naive frame-wise concatenate strategy (cat_t). Our method (cat_c) consistently outperforms the frame-wise concatenating baseline (cat_t) when training on the Bridge dataset. Below we provide some qualitative examples of synthesized videos with 40k training steps.

Improving Inference Efficiency with
Denoising Diffusion Implicit Models

This section investigates the possibility of accelerating the sampling process using Denoising Diffusion Implicit Models (DDIM; Song et al., 2021). To this end, instead of iterative denoising 100 steps, as reported in the main paper, we have experimented with different numbers of denoising steps (e.g., 25, 10, 5, 3) using DDIM. We found that we can generate high-fidelity videos with only 1/10 of the samplimg steps (10 steps) with DDIM, allowing for tackling running time-critical tasks. We present the synthesized videos with 25, 10, 5, 3 denoising steps as follows.

DDIM 25 steps: The quality of the synthesized videos are satisfactory depsite minor temporal inconsistency (gripper/object disappeared/duplicated) compared to our DDPM (100 steps) videos reported in previous section.

DDIM 10 steps: The quality of the synthesized videos is similar to those generated with 25 steps.

DDIM 5 steps: The temporal inconsistency issue is more severe with only 5 denoising steps.

DDIM 3 steps: The temporal inconsistency issue is more severe and some objects are blurry.

BibTex

                
@article{Ko2023Learning,
title={{Learning to Act from Actionless Videos through Dense Correspondences}},
author={Ko, Po-Chen and Mao, Jiayuan and Du, Yilun and Sun, Shao-Hua and Tenenbaum, Joshua B},
journal={arXiv:2310.08576},
year={2023},
}

Learning to Act from Actionless Videos through Dense Correspondences

Framework Overview

Real-World Franka Emika Panda Arm with Bridge Dataset

Synthesized Videos

Robot Executions

Meta-World

Synthesized Videos

Robot Executions

iTHOR

Synthesized Videos

Robot Navigation

Cross-Embodiment Learning (Visual Pusher)

Failed executions

Successful executions

Zero-Shot Generalization on Real-World Scene with Bridge Model

Extended Analysis and Ablation Studies

Comparison of First-Frame Conditioning Strategy

Improving Inference Efficiency withDenoising Diffusion Implicit Models

BibTex

Learning to Act from Actionless Videos
through Dense Correspondences

Improving Inference Efficiency with
Denoising Diffusion Implicit Models