Paper Source: https://arxiv.org/pdf/1701.02676.pdf
This paper proposes a general approach to achieve “Image-to-Image Translation” by using deep convolutional and conditional generative adversarial networks (GANs). In this work, the authors develop a two-step unsupervised learning method to translate images without specifying any correspondence between them.
What is “Image-to-Image Translation”?
“Image-to-Image Translation” means to transform an image from its original form to some synthetic forms (style, partial contents, etc.) automatically, while keeping the original structure or semantics. In this paper, the authors focus on translating images from one domain to other domains, e.g. faces swap, gender transformation.
The entire network architecture is shown as in figure 2:
The learning process is separated into two steps: learning shared feature and learning image encoder.
First step: Learning shared features
As shown in the left hand side of figure 2, the authors use auxiliary classifier GAN (AC-GAN) to learn the global shared features of images sampled from different domains. These shared features are represented as a latent vector z.
After this step, the generator G is capable of generating corresponding images for different domains by keeping the latent vector fixed and changing the class label.
Second step: Learning image encoder
The authors introduce a new method to embed images into latent vectors. They apply image encoder E after generator G achieved from the first step, and minimize the mean-square-error (MSE) between the input latent vector and output latent vector to train this image encoder E, as shown in the middle of figure 2.
Comparing with prior methods, which often apply generator G after image encoder E to train E by reconstructing images (minimizing MSE between input image and generated image), this new method is not only able to reconstruct the detailed feature but it also speed up the training procedure.
After the two steps mentioned above, images can be translated by using trained E and trained G as shown in right column in Figure 2: given an input image X to be translated, they use the trained image encoder E to embed X_real with domain/class label c=1 into latent vector Z. Then let Z with another domain/class label c=2 be the input of the trained generator G to generate image X_fake as the final result.
As shown in figure 3, the trained network is able to transform gender of the target-person in images. The facial expressions and other facial details are well maintained after transformation. The synthetic images also have similar image quality as the input images.
As shown in figure 4, the trained network is as well able to swap faces in images extracted from videos. This trained network not only maintains the facial expression in some degree, but also the orientations of face/head, which is useful for face swapping in videos.
The authors propose a two-step learning method for universal unsupervised image-to-image translation. This method can translate images with invariant facial expressions and swap faces considering for different orientations of heads/faces. Their method has universal feature to support different learning scenarios.
5. Thoughts from the Reviewer
The authors use a two-step unsupervised learning method to translate images, which has relatively satisfying results. Instead of reconstructing the entire image, the authors reconstruct latent vectors to train the image encoder. On one hand, this speeds up the training procedure because the dimensions of the latent vector are much lower than that of an original image. On the other hand, latent vectors possess the global feature of the original image, which is more important than pixel-wise details of the original image.
- Images used to train this network have identical scale of 64 x 64 pixels, which isn’t very high resolution. Hence, it is difficult to tell whether the details of output images are well generated. For example, from these figures, eyes and clothes in input images have clearer edges than that in outputs.
- The authors think that this network can learn to reconstruct background to some degree. But these generated backgrounds are more like noises than “well reconstructed backgrounds”. Backgrounds as noise or interference is a common problem in current GANs, the influence of which is very hard to decrease.
- The face swapping result shows that this network can maintain the orientations of heads/faces, but it cannot learn and generate the corresponding angle of line of sight. For example, in figure 4, although all images of Obama and Hillary have similar orientations, the angles of line of sight are all different. For face swapping tasks, line of sight is an important semantic information, which should be maintained after generation. This network does not perform well i on this point.
This paper is mainly based on the following models and researches.
The AC-GAN used in this paper is based on paper CONDITIONAL IMAGE SYNTHESIS WITH AUXILIARY CLASSIFIER GANS (https://arxiv.org/abs/1610.09585). The following figure shows the comparison between several different GANs:
The idea “Domain Transfer Network” in this paper comes from paper UNSUPERVISED CROSS-DOMAIN IMAGE GENERATION (https://arxiv.org/abs/1611.02200), which is accepted by ICLR-2017 for poster session. The main idea is as follows:
Given two related domains, S and T. It is to learn a generative function G which maps the input sample from domain S to the domain T, such that the output of a given function f, which accepts inputs in either domains, would remain unchanged.
The architecture of this network is shown as below:
Author: Yiwen Liao |Editor: Hao Wang | Localized by Synced Global Team: Xiang Chen