This paper has been accepted by CVPR2017
Paper Source 1: https://arxiv.org/pdf/1704.05838.pdf
Paper Source 2: https://sites.google.com/site/yijunlimaverick/facecompletion
This paper proposes a deep generative model for face completion, which can directly generate facial components for the missing regions of a face image, as shown in the following figure.
Unlike many other prior works, the authors used two discriminators at the same time to construct the entire model for the face completion task, such that not only does the local image patch look much more realistic, but also the entire image.
2.1 Model Architecture
As the figure above shows, the entire model consists of a generator, two discriminators, and a semantic parsing network.
Generator in this project is an autoencoder based on VGG-19 (from “conv1” to “pool3”). The authors also constructed two more convolution layers with a pooling layer on top, and added a fully-connected layer after as encoder. The decoder is symmetric to the encoder with unpooling layers.
Local Discriminator is used to determine whether the synthesized image patch in the missing region is real or not.
Global Discriminator aims at determining the faithfulness of the entire image. These two discriminators have architectures similar with the paper “Unsupervised representation learning with deep convolutional generative adversarial networks”.
Semantic Parsing Network based on the paper “Object contour detection with a fully convolutional encoder-decoder network”, it is used to refine the generated images from the GAN above. Because such kind of network is able to extract the high-level features of an image. This way, the generated image patches (facial components) are of more natural shape and size.
2.2 Loss Function
The reconstruction loss L_r in the generator is used to compute the L_2 distance between the generator output and the original image.
The two discriminators share the same definition of the loss function L_ai which is commonly used in GANs as shown in equation 1.
The difference between loss functions of the two discriminators is that, the local discriminator (L_a1) only back-propagates loss gradients for the missing region, while the global discriminator (L_a2) back-propagates loss gradients over the entire image.
L_p in the parsing network is the pixel-wise softmax loss, which is also commonly used in many other classification neural networks.
From above, the entire loss function is defined as follows:
For training the network, they divided the training procedure into three stages. First, they only train the network with L_r to reconstruct images. Then, they fine-tune the network with the local adversarial loss. At the last stage, they use global adversarial loss and semantic regularization to obtain the final result.
3. Experiment Result
This generative face completion algorithms has very good result visually, as shown in the first figure of this review. Figure 7 shows that this model is robust for different kinds of masks, which is very close to real-world applications. No matter what the mask looks like, the network generates satisfactory result.
The authors also compare the influence of the size of the masks as shown in figure 9. They found that there is a local minimum around medium sized masks. Because the mask of this size is mostly likely to occlude only part of the facial component, which is more difficult to synthesize for this model.
Figure 12 shows the limitations of this generative model. First, although this model consists of a semantic parsing network to gain some high-level features during training procedure, it is not able to recognize the position/orientation of the face. Hence, this model cannot handle some unaligned faces. Second, as mentioned above, this model has difficulty generating part of a facial component than to generate an entire facial component. Because this model cannot always detect the spatial correlations between adjacent pixels.
This model, which is based on GAN with two discriminators (two adversarial loss functions) and a semantic regularizations network, can handle face completion tasks. It can successfully synthesize contents for missing facial parts from random noise.
5. Thoughts from the Reviewer
This paper proposed a generative model with successful results on face completion tasks. The authors provided both quantitative and qualitative evaluations of their model, so their results are relatively reliable.
Contributes of this paper:
- They provide a good new way to design GAN models: Use several discriminators for different purposes at the same time. For example, conventional autoencoder uses L_2 distance to reconstruct images and thus often outputs very smooth results. Prior works often use embeddings from a deep classification network to ameliorate this effect. But in this paper, the authors show the use of different discriminators can also achieve a better result with less smoothness.
- The authors divided the training procedure into several stages, which is a good idea for training GANs. It is just like how human learns: People first learn the outline of one subject (similar to image reconstruction in this project), then they learn the details in each chapter step by step (similar to fine-tune in the second and third stage in this project).
- The authors also showed “Peak Signal-to-Noise Ratio” (PSNR) and “Structural Similarity Index” (SSIM) are not sufficient enough to evaluate reconstruction or generation results, because these two metrics favor smooth and blurry results. As shown in Figure 3 and Table 1 and Table 2, sub-figure M1 has higher SSIM and higher PSNR than that of M2 and M3. But M2 and M3 obviously have more semantically valid generation results.
- This paper showed semantic parsing network can provide more (semantic) constraints on the random noise of a GAN to obtain a more realistic result. And Figure 10 further shows these constraints make the GAN recognize facial components, so the GAN can generate different missing parts with similar shape and size from different random noises, only differentiating in some details, like shades of an eyebrow.
- One of the limitations of this model is that it cannot handle some unaligned faces. A face-warping network could be added to normalize the input faces.
- Use images of other categories (like buildings or scenery) to train this model to judge whether it is robust for other completion tasks.
 Radford, Alec, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with deep convolutional generative adversarial networks.” arXiv preprint arXiv:1511.06434 (2015).
 Yang, Jimei, et al. “Object contour detection with a fully convolutional encoder-decoder network.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
Author: Yiwen Liao | Editor: Junpei Zhong, Xiang Chen