Face Synthesis from Facial Identity Features

Paper source: https://arxiv.org/abs/1701.04851

Paper Authors: authors

1. Introduction

This paper presents a method for synthesizing a frontal, neutral-expression image of a person’s face given an input facial photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network.

The presented method is largely invariant to lighting, pose and facial expression of the original input images, because the decoder network is trained only using frontal, neutral-expression photos.

The output images can be used for many applications, such as analyzing facial attributes, exposure and white balance adjustment, or creating a 3-D avatar.

2. Method

2.1. Preprocessing

Assumption: Training set is a set of front-facing, neutral-expression training images.

Each image is decomposed into a texture T and a set of landmarks L using off-the-shelf landmark detection tools and the warping technique.

2.2 Encoder

A pre-trained FaceNet with fixed parameter is used to extract f-dimensional feature vector F from the input image I. The output of the lowest layer of FaceNet that is not spatially varying, which has 1024 dimensions, is used in this project. A fully-connected layer from 1024 to f dimensions is trained on top of this layer.

2.3 Decoder

Feature F is separately mapped into Landmarks L and Texture T, and the final result is rendered using warping.
L is generated using shallow multi-layer perceptron with ReLU applied to F. T is generated using a deep CNN. The decoder combines the textures and landmarks using the differentiable warping technique.

2.4 Training Loss

First term penalizes the mean squared error between predicted landmarks and ground truth landmarks.
Second term penalizes the mean absolute error between predicted textures and ground truth textures.
Third term penalizes the dissimilarity of the FaceNet embeddings of the input and output images using negative cosine similarity.

3. Tricks

Differentiable Image Warping
First, a dense flow field is constructed from the sparse displacements defined at the control points using spline interpolation.

Then, the flow field is applied to I0 (the original image) in order to obtain I1 (warped image) using bilinear interpolation, which is differentiable.

Differentiable Spline Interpolation
This project uses polyharmonic interpolation. Linear interpolant is chosen because it is more robust to overshooting than the thin-plate spline, and the linearization artifacts are difficult to detect in the final texture.

Data augmentation based on morphing
Producing random face morphs: Given a seed face A, the authors first pick a target face by selecting one of the k = 200 nearest neighbors of A at random. Given A and the random neighbor B, their landmarks and textures are independently linearly interpolated, where the interpolation weights are drawn uniformly from [0, 1].

Gradient-domain Compositing: To make the augmented images more realistic, they paste the morphed face onto an original background using a gradient-domain editing technique.

4. Experiments and results

4.1 Collecting photographs

About 12K images from VGG Face dataset are chosen as the training dataset for this project. These images are aligned to undo any roll transformation, scaled to maintain an interocular distance of 55 pixels, and cropped to 224 x 224. There are multiple images for an individual. So there are finally about 1K unique identities with 3 or more images. All images are averaged for each individual by morphing. Some backgrounds with high noise are manually removed. Test-Dataset comes from Wild dataset.

4.2 Model Robustness

The shape and skin tone of the face is stable across different poses and illumination, but variations such as hair style and hair color are captured in the output image.

4.3 Applications

3D Model Fitting The fitting process produces a well-aligned, 3D face mesh that could be directly used as a VR avatar, or could serve as an initialization for further processing, for example in methods to track facial geometry in video.

Automatic Photo Adjustment This algorithm balances the face regardless of the effect on the other regions of the image, producing more consistent results across different photos of the same person.

5. Conclusion

5.1 Advantages

This project uses convolutional neural network, which is robust to variation in the inputs, such as lighting, pose, and expression.
This method provides a variety of down-balancing images and customized 3D avatars.
Spline interpolation being used as a differentiable module inside a neural network is fresh idea. They encourage further application of this technique.

5.2 Future works

Overall quality of the generated images could be improved.
There are still many noise artifacts, especially in the backgrounds. The model could be trained on a broader selection of images to avoid pixel-level losses entirely.

6. Thoughts from the Reviewer

General comment:
The method presented by this paper could be logically efficient. This method uses convolutional neural network to make the entire model invariant to lighting, pose and details on faces, which is the key difference from prior frontalization methods, but at the same time results in more noise in the background of the output images. This method uses landmarks and spline interpolation to make the result more realistic. But how much improvement in subsequent tasks this method can achieve compared with prior methods was not perfectly proven or shown.

Possible Problems of this Paper:

The map from FaceNet feature vector to face images is underconstrained. The authors think that a map from the feature vector to the “normalized” face images is intuitively one-to-one (bijective), which lacks of mathematical proof in this paper.
There are three loss terms in this neural network. The authors only compare the results between “with FaceNet loss” and “without FaceNet loss”, but the necessity of “Landmarks loss” isn’t mentioned.
It lacks a more objective criterion to evaluate the generated results from this project. The method presented by this paper is conditionally visually better than prior frontalization methods, which is subjective.
Frontalization is usually used to pre-process face images to make the subsequent tasks (eg. 3D model fitting) more efficient and more accurate. In this paper, the authors didn’t mention enough how much improvement can be achieved in subsequent tasks using their method.

Analyst: Yiwen Liao | Localized by Synced Global Team : Xiang Chen

Face Synthesis from Facial Identity Features

1. Introduction