Talk-to-Edit:

Fine-Grained Facial Editing via Dialog

ICCV 2021

Paper

Abstract

Facial editing is an important task in vision and graphics with numerous applications. However, existing works are incapable to deliver a continuous and fine-grained editing mode (e.g., editing a slightly smiling face to a big laughing one) with natural interactions with users. In this work, we propose Talk-to-Edit, an interactive facial editing framework that performs fine-grained attribute manipulation through dialog between the user and the system. Our key insight is to model a continual "semantic field" in the GAN latent space. 1) Unlike previous works that regard the editing as traversing straight lines in the latent space, here the fine-grained editing is formulated as finding a curving trajectory that respects fine-grained attribute landscape on the semantic field. 2) The curvature at each step is location-specific and determined by the input image as well as the users' language requests. 3) To engage the users in a meaningful dialog, our system generates language feedback by considering both the user request and the current state of the semantic field.

We also contribute CelebA-Dialog, a visual-language facial editing dataset to facilitate large-scale study. Specifically, each image has manually annotated fine-grained attribute annotations as well as template-based textual descriptions in natural language. Extensive quantitative and qualitative experiments demonstrate the superiority of our framework in terms of 1) the smoothness of fine-grained editing, 2) the identity/attribute preservation, and 3) the visual photorealism and dialog fluency. Notably, user study validates that our overall system is consistently favored by around 80% of the participants.

Image

We propose Talk-to-Edit, an interactive facial editing framework that performs fine-grained facial editing through dialog between the user and the system.

The Pipeline

Talk-to-Edit

The pipeline consists of three components:

  1. Language Encoder: understands user request.
  2. Semantic Field: performs fine-grained editing.
  3. Talk Module: provides meaningful natural language feedback.

Image

The

Semantic Field

In the StyleGAN latent space, the attribute score is a scalar field. The gradient of attribute score field with respect to the latent code is a vector field, which we term as “semantic field”. We learn the semantic field, and move the latent code along the learned field lines to achieve facial editing.

Image

The Dataset

CelebA-Dialog

We contribute a large-scale visual-language face dataset named CelebA-Dialog:

  1. Facial images are annotated with rich fine-grained labels, which classify one attribute into multiple degrees according to its semantic meaning.
  2. Accompanied with each image, there are captions describing the attributes and a user request sample.

Image

Illustration of CelebA-Dialog dataset. We show example images and annotations for the smiling attribute. Below the images are the attribute degrees and the corresponding textual descriptions. We also show the fine-grained label distribution of the smiling attribute.

Qualitative

Results

Image

Qualitative results on the manipulation of five attributes respectively: Bangs, Eyeglasses, Beard, Smiling, Young.

Paper

Citation

@InProceedings{jiang2021talkedit,
 author = {Jiang, Yuming and Huang, Ziqi and Pan, Xingang and Loy, Chen Change and Liu, Ziwei},
 title = {Talk-to-Edit: Fine-Grained Facial Editing via Dialog},
 booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
 year = {2021}
}

Image

Contact


Yuming Jiang
Email: yuming002 at e.ntu.edu.sg

Image

Contact


Ziqi Huang
Email: hu0007qi at e.ntu.edu.sg