Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

University of Michigan
ICLR 2024
Correspondence to: ude.hcimu@gnegd

Motion Guidance is a method to achieve motion-based editing. Given an image to edit and a flow field, indicating where each pixel should go, we produce a new image with the desired motion.


Our method is zero-shot, and supports motions such as rotations, translations, stretches, scaling, shrinking, homographies, general deformations, and works on both generated and real images.


(If you are interested in using the interactive flow visualizations from this page for your own project, a minimum working example is provided here.)

Examples

"a photo of a cat"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"an apple on a wooden table"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a painting of a sunflower"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a teapot floating in water"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a photo of a laptop"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a photo of a topiary"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

See below for more examples.

Method

To achieve motion-based editing, we propose using guidance [1] during sampling in a diffusion model. At each time step in the reverse process, we perturb the noisy estimate in the direction that minimizes some loss function. As a loss function, we propose using the difference between the desired motion and the current motion of the noisy sample, with respect to the original image, as estimated by an off-the-shelf (differentiable) optical flow network [2]. Effectively, we find a sample that is likely under the diffusion model, while attaining a low loss.


In order to achieve good results, we find we also need to use a few tricks, including color regularization, reconstruction guidance [3] [4], occlusion masking, and edit masking. Please see the paper for additional details.



A GUI for Flow Construction

All optical flows used in this paper (with the exception of Motion Transfer results) are generated by composing elementary flows together, and masking with segmentations masks from SAM. These elementary flows consist of translations, rotations, scaling, and more complex deformations. We show examples of how these flows can be created using a simple UI below:

Segmenting out the topiary tree, constructing a translation optical flow field, and then applying our motion editing method. (Not real-time)
Segmenting out the apple, constructing a shrinking optical flow field, and then applying our motion editing method. (Not real-time)

Construction of elementary flows with a GUI is fairly straightforward, using a click and drag interface to specify translations, rotations, scaling, and stretching. More complex deformations can be constructed by composing or interpolating these flows:

Image
Flow colorwheel for reference. Color represents flow direction, and brightness represents magnitude
Translations: These can be defined by clicking and dragging a vector
Rotations: The initial click defines the center of rotation, and dragging further away increases the angle of rotation
Scaling: The initial click defines the center of scaling. Dragging outside the circle indicates magnifying, inside the circle indicates shrinking
Stretching: Stretches with respect to a line defined by the first click. The notch denotes the boundary between squeezing and stretching
Interpolated Stretching: We can interpolate between stretches and squeezes, yielding a continuous and complex deformation, as seen in the topiary example.

Various Motions – Same Source Image

Below, we show various translations, scalings, and stretches to the same source image.

"a teapot floating in water"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a teapot floating in water"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a teapot floating in water"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a teapot floating in water"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a teapot floating in water"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a teapot floating in water"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a teapot floating in water"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

Motion Transfer

In some cases, we can extract motion from video and apply that motion to images. For example, below we extract to motion from the video of the Earth spinning, and apply it to various, real animal images. The extracted flow is not perfectly accurate, and does not perfectly align with the source image, but because we optimize a soft objective, our method is able to produce reasonable results.

Image

Frame 1

Image

Frame 2

Image

Estimated Flow

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

Additional Examples

"a photo of a cute humanoid robot on a solid background"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"an aerial photo of a river"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a photo of a modern house"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a photo of a lion"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a photo of a lion"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

[real image]

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a painting of a lone tree"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

"a painting of a lone tree"

Image

Target Flow

(Hover over me)

Image

Source Image

(Hover over me)

Image

Motion Edited

Limitations

Our method suffers from various limitations that would benefit from further research. Below, we show examples of failure cases. (a) Because we use an off-the-shelf optical flow network, some flow prompts are severely out-of-distribution, such as a vertical flip, and fail. (b) Because we optimize a soft objective, seeking to achieve a likely sample under the diffusion model as well as a sample that minimizes the guidance energy, we sometimes see loss of identity in our generations. (c) The one-step approximation is sometimes unstable, and can diverge catastrophically. Additionally, we inherit the limitations of diffusion models and Universal Guided Diffusion [4], such as slow sampling speeds. Image

Related Works

DragGAN enables drag-based editing of images using pretrained GANs. Users select a point on an image, and indicate where it should move to.


Inspired by DragGAN, Drag Diffusion and Dragon Diffusion port the drag-based editing capabilities of DragGAN to more versatile diffusion models.


Related works have proposed guidance on various objectives, including on an LPIPS loss, "readout heads", the internal features of the diffusion network itself, and segmentation, detection, and facial recognition networks.

References

[1] Dhariwal, Nichol, “Diffusion Models Beat GANs on Image Synthesis”, NeurIPS, 2021.

[2] Teed, Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow”, ECCV, 2020.

[3] Ho et al., “Video Diffusion Models”, arXiv, June 2022.

[4] Bansal et al., “Universal Guidance for Diffusion Models”, ICLR, 2024.

Real Image Attribution

BibTeX

@article{geng2024motion,
  author    = {Geng, Daniel and Owens, Andrew},
  title     = {Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators},
  journal   = {International Conference on Learning Representations},
  year      = {2024},
}