Have you ever thought of swapping faces with one of your favorite movie characters?
Researchers from Microsoft Research Asia and Shanghai Jiaotong University have recently published a paper on transferring visual attributes of images using a new technique called Deep Image Analogy.
To demonstrate their technique, they swapped faces between Mona Lisa and Neytri beautifully, using image analogy and deep CNN features . The Figure 1 shown below demonstrates results of visual attribute transfer using Deep Image Analogy:
In the first row of Figure 1, Mona Lisa (by Leonardo da Vinci) and Neytiri (from the sci-fi movie Avatar) managed to swap faces, and the second row shows two panda photos with a style transfer: the panda photo was transformed into a sketch, while the panda previously in the sketch is now seen in a photo.
The authors developed this technique to allow visual attribute transfer between a pair of images that might be “visually different but semantically similar”. According to the authors, visual attributes include color, texture, and style. Moreover, two images are considered as “semantically similar” if they depict the same type of scene including objects from the same class. In other words, the transfer of visual attribute is more likely to be successful if the pair of input images are from the same semantic category, for example, a panda in a photo can swap styles with another panda in a sketch, while the technique might fail when trying to transfer visual attributes between a human and a shark because the “objects” in the input images do not belong to the same class. In essence, the technique must first recognize the objects in both images, and then continue with the visual attribute transfer.
One of the major contributions by the authors is that Deep Image Analogy managed to create semantically similar deep correspondences between input images of different domains, unlike existing methods that are either essentially based on low-level features(for example, using SIFT Flow or Optical Flow), are domain-specific, or cannot generalize to cross-domain images.
2. Visual Attribute Transfer
In order to transfer visual attributes between images, an essential step is creating dense correspondences between them. The authors were inspired by the ideas related to image analogy , which involves dense mapping between images from different domains to create deep correspondences.
Here, an image analogy is defined as A : A’ :: B : B’, where A and A’, as well as B and B’ , are in pixel-wise correspondences. In addition, A’ relate to A in the same way as B’ to B.
As seen in the first row of Figure 1, A and B’ are input images which are semantically similar, since they are both portraits of a female, the goal is to output A’ and B after a visual attribute transfer.
2.1 Problem Statement
Given a pair of images A and B’ with a similar semantic structure, assuming they have different visual attributes (e.g., style, color or texture), the goal is to find a mapping from A to B’ (or B’ to A), and output two images A’ and B after the visual attribute transfer.
It is far from easy to directly map from A to B’ , thus the authors formulated the mapping problem as image analogies:
A : A’ :: B : B’ where A’ and B are two latent variables with bi-directional constraints, implying that (1) A and A’ (B and B’ )must be in the same spatial layout ; (2) A and B (also A’ and B’) have similar visual attributes (texture, color, lighting etc.)
In Figure 2, a mapping from A to B’ is required, and A and B’ both have the same semantic structure: a portrait of a female. In order to avoid the direct and difficult mapping from A to B’ (shown in red), the proposed method divides the mapping from A to B’ into two tractable mappings: (1) A to A’ as an in-place mapping (in yellow), making sure that the nose is “in the right place”; (2) A similar-appearance mapping from A’ to B’ (in blue), where the noses are similar in appearance.
3. Deep Image Analogy
The visual attribute transfer of images is achieved using image analogy and deep CNN features. The authors refer the entire process as “deep image analogy”. Figure 4 illustrates the pipeline of the deep image analogy system.
Deep CNN features are first computed for the input images A/B’ through a pre-trained 5-layer CNN (a VGG-19 network  trained on the ImageNet  database for object recognition), and the feature maps of the latent images A’/B are initialized at the coarsest layer. Here, the features of A’/B are unknown, and will be estimated in a coarse-to-fine manner.
3.2 Nearest-neighbor Field (NNF) Search
Originally, PatchMatch  is a fast randomised algorithm for calculating approximate NNFs between a pair of images. The nearest patch matches can be found using random sampling. However, the authors here considered PatchMatch in a deep feature domain, so as to provide better correspondences between images and be incorporated into their latent image reconstruction.
At each layer, both a forward and a reverse NNF will be estimated. Basically, an NNF search involves the mapping of a pixel in a feature map to the corresponding nearest neighbor in anther feature map. As shown in Figure 4, such correspondences will be created between features maps of A(input) and B(latent variable), as well as A'(latent variable) and B'(input).
In order to search for the nearest neighbor in another feature map given a point in the input feature map, the distance can be calculated using the energy function shown below.
3.3 Latent Image Reconstruction
NNFs and feature maps obtained from the NNF search will serve as inputs for reconstructing features of latent images (A’/b) at the next CNN layer.
As illustrated in Figure 6, the reconstruction of a latent image consists of feature map warp at the current layer, followed by de-convolution in the next layer, and then a fusion operation is done to reconstruct the image. Moreover, Figure 6 also shows how we can recover the latent image A’. Basically, the ideal A’ should inherit the content structure from the input A, while exhibiting the corresponding visual content from B’, which can be selected by a weighted mask that constructs a linearly weighted combination of the structure from A and the visual information from B’.
3.4 Nearest-neighbor Field Unsampling
The NNFs are computed in a coarse-to-fine manner: at the coarsest layer, the mappings are randomly initialized. As for other layers, the NNFs obtained at each layer will be further upsampled to the next layer, serving as their initialization.
Figure 8 shows how the mappings between A and B’ are gradually refined from coarse to fine. And the proposed deep image analogy method has achieved better matching results (middle rows) compared to the layer-independent results (bottom rows).
The first rows demonstrate how the mappings from A to B’ are done in a hierarchical way, while the remaining rows illustrate the mappings from B’ to A using the same method.
At each layer, the three steps of NNF search, latent image reconstruction, and NNF upsampling are repeated, refining deep correspondences between images from coarse to fine.
After extracting the NNFs at the lowest layer, the latent image can be reconstructed by patch aggregation in the pixel layer of the image. As for the latent image A’ (structure from A and visual content from B’), the aggregation will be performed on the extracted NNFs in B’.
The pseudocode for deep image analogy is listed in Algorithm 1.
The authors show results of applying their deep image analogy approach to four different tasks in visual attribute transfer: photo-to-style, style-to-style, style-to-photo and photo-to-photo.
Transferring visual attributes between photos and styled artworks allows users to transfer the styles between images, for example, a photo of a male portrait can “borrow” the style of a sketch of another individual, and obtain a sketch of the original portrait. The photo-to-style transfer results are shown in Figure 13.
Figure 16 provides impressive transfer results using deep image analogy. As seen from the results, a photo can transform into an oil painting (or the other way around).
This can be considered as the inverse problem of photo-to-style, but is in fact more difficult. This is due to the fact that artworks tend to have less details than photos, which trades for more creativeness. Results of turning artworks into a photo are shown in Figure 17.
Photo-to-photo transfers only apply to the transfer of the color and tone attribute of images.
A very creative application of the deep image analogy is to generate the time-lapse sequences using reference images of another semantic-related scene, as shown in Figure 20. Despite different scenes from two images, the semantic correspondences can still be identified, such as tree-to-tree and mountain-to-mountain.
5. Concluding Remarks
The authors introduced a new technique called Deep Image Analogy for the purpose of transferring visual attributes between semantically similar images. Several transfer results demonstrate the applicability of such technique in various tasks, including photo-to-style and style-to style transfer tasks.
In addition, the authors also provided examples of failure cases, illustrated in Figure 22 where the technique does not achieve satisfactory transfer results, including when there are variations in scale and viewpoints, etc.
Several improvements can be done, according to the authors, such as relaxing the assumption that the transfer should maximally preserve the content structure, or to pre-train the CNN model on a domain-specific dataset, and so on.
 HERTZMANN, A., JACOBS, C. E., OLIVER, N., CURLESS, B., AND SALESIN, D. H. 2001. Image analogies. In Proc. ACM SIGGRAPH.
 SIMONYAN, K., AND ZISSERMAN, A. 2014. Very deep convolu- tional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
 RUSSAKOVSKY, O., DENG, J., SU, H., KRAUSE, J., SATHEESH, S., MA, S., HUANG, Z., KARPATHY, A., KHOSLA, A., BERN- STEIN, M., ET AL. 2015. Imagenet large scale visual recogni- tion challenge. International Journal of Computer Vision 115, 3, 211–252.
 BARNES, C., SHECHTMAN, E., FINKELSTEIN, A., AND GOLD- MAN, D. B. 2009. Patchmatch: A randomized correspon- dence algorithm for structural image editing. ACM Trans. Graph. (Proc. of SIGGRAPH) 28, 3.
Author: Olli Huang | Localized by Synced Global Team: Junpei Zhong