Learning to Act from Actionless Videos
through Dense Correspondences

Po-Chen Ko
Jiayuan Mao
Yilun Du
Shao-Hua Sun
Joshua B. Tenenbaum
Image
Image
Paper arXiv Code


Framework Overview


Image



Real-World Franka Emika Panda Arm with Bridge Dataset


We train our video generation model on the Bridge data (Ebert et al., 2022), and perform evaluation on a real-world Franka Emika Panda tabletop manipulation environment.

Synthesized Videos

Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image


Robot Executions

Task: put apple in plate
Task: put banana in plate
Task: put peach in blue bowl


Meta-World


We train our video generation model on 165 videos of 11 tasks. We evaluate on robot manipulation tasks in Meta-World (Yu et al., 2019) simulated benchmark.

Synthesized Videos

Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image


Robot Executions

Task: Assembly
Task: Door Open
Task: Hammer
Task: Shelf Place


iTHOR


We train our video generation model on 240 videos of 12 target objects. We evaluate on object navigation tasks in iTHOR (Kolve et al., 2017) simulated benchmark.

Synthesized Videos

Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image


Robot Navigation

Task: Pillow
Task: Soap Bar
Task: Television
Task: Toaster


Cross-Embodiment Learning (Visual Pusher)


We train our video generation model on ~200 actionless human pushing videos and evaluate in Visual Pusher (Schmeckpeper et al., 2021, Zakka et al., 2022) robot environment.

Failed executions


Image
input image
Image
video plan
Image
execution


Image
input image
Image
video plan
Image
execution


Successful executions


Image
input image
Image
video plan
Image
execution


Image
input image
Image
video plan
Image
execution



Zero-Shot Generalization on Real-World Scene with Bridge Model


We show that our video diffusion model trained on Bridge data (mostly toy kitchen) already can generalize to complex real-world kitchen scenarios. Note that the videos are blurry since the original video resolution is low (48x64).

Image
Task: pick up banana
generated video
Image
Task: put lid on pot
generated video


Image
Image
Task: put pot in sink
generated video
Image


Extended Analysis and Ablation Studies


Comparison of First-Frame Conditioning Strategy

We compare our proposed first-frame conditioning strategy (cat_c) with the naive frame-wise concatenate strategy (cat_t). Our method (cat_c) consistently outperforms the frame-wise concatenating baseline (cat_t) when training on the Bridge dataset. Below we provide some qualitative examples of synthesized videos with 40k training steps.

Image
Image
Image
Image
Image
Image


Improving Inference Efficiency with
Denoising Diffusion Implicit Models

This section investigates the possibility of accelerating the sampling process using Denoising Diffusion Implicit Models (DDIM; Song et al., 2021). To this end, instead of iterative denoising 100 steps, as reported in the main paper, we have experimented with different numbers of denoising steps (e.g., 25, 10, 5, 3) using DDIM. We found that we can generate high-fidelity videos with only 1/10 of the samplimg steps (10 steps) with DDIM, allowing for tackling running time-critical tasks. We present the synthesized videos with 25, 10, 5, 3 denoising steps as follows.

DDIM 25 steps: The quality of the synthesized videos are satisfactory depsite minor temporal inconsistency (gripper/object disappeared/duplicated) compared to our DDPM (100 steps) videos reported in previous section.
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image


DDIM 10 steps: The quality of the synthesized videos is similar to those generated with 25 steps.
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image


DDIM 5 steps: The temporal inconsistency issue is more severe with only 5 denoising steps.
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image


DDIM 3 steps: The temporal inconsistency issue is more severe and some objects are blurry.
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image


BibTex

                
@article{Ko2023Learning,
title={{Learning to Act from Actionless Videos through Dense Correspondences}},
author={Ko, Po-Chen and Mao, Jiayuan and Du, Yilun and Sun, Shao-Hua and Tenenbaum, Joshua B},
journal={arXiv:2310.08576},
year={2023},
}