The celebrated romance movie trilogy Before Sunrise, Before Sunset, Before Midnight captures a human relationship as it begins, starts over and deepens over almost two decades. Film directors often use daylight conditions as moody metaphors with proven viewer resonance. For computer vision researchers, modelling realistic dawn to dusk illuminations is a challenging and ongoing image manipulation task.
In a bid to generate high-resolution images showing realistic daytime changes while keeping accurate scene semantics, a team of researchers from Samsung AI Center, National Research University Higher School of Economics, and Skolkovo Institute of Science and Technology have proposed a novel image-to-image translation model, HiDT (High Resolution Daytime Translation).
The team took the modelling of various daytime appearances for given landscape images as their main task. The resulting HiDT model learns on fully unsupervised data and an upscaling technique for generating high-resolution images while safeguarding scene semantics. The work is introduced in the paper High-Resolution Daytime Translation Without Domain Labels.
Image-to-image translation involves automatically transforming an image from its original form to synthetic forms (style, partial content, etc.) while maintaining original structures and semantics. Image-to-image translation methods have been successfully used for converting images across domains, e.g. converting a Monet painting to a landscape photo, or turning zebras into horses. So it was a no-brainer for researchers to use the technique to model daytime changes.
Although current state-of-the-art methods such as NVIDIA’s FUNIT (Few-Shot Unsupervised Image-to-Image Translation) model use several images from the target domain as guidance for translation, this few-shot setting still requires domain annotations during training. In this specific task, domains correspond to various times of day and different lighting, but the domain labels are hard to define. Weakly supervised data is available in timelapse videos, but not all such videos are sufficiently high resolution.
One of the authors, Samsung AI Center Researcher Denis Korzhenkov, told Synced that learning on fully unsupervised data without any domain labels promises significant improvements over the state-of-the-art approaches.
Although the HiDT model doesn’t use labels, the researchers did manually label a small part of the 20,000 landscape photos they collected from the Internet. They defined four classes (night, sunset/sunrise, morning/evening, noon) to balance the training data, and used the labels to train the baselines since both FUNIT and DRIT++ pipelines require labels during training. Also, labels were required to calculate metrics for the quantitative evaluation section. The team did not use the labels in any way during the training of the HiDT model.
Korzhenkov explains that while modern image-to-image translation models rely on the decomposition of images into content and style codes, this decomposition can be ambiguous due to a lack of prior knowledge. Previous studies including the state-of-the-art methods used labels to create a correspondence between domain label and style code. For instance, in the FUNIT study, NVIDIA researchers used a discriminator conditioned on the target label to ensure the target style corresponded to the target label.
Similarly, in HiDT’s encoder-decoder architecture the encoder performs decomposition into style and content and the decoder generates a new image comprising both components. Researchers say the adaptive instance normalizations (AdaIN)-based architecture can understand the desired meaning of the style code in the training data, while the neural model manages to learn the correct content-style decomposition.
The experiment results show the HiDT model successfully learns daytime translation for high-resolution landscape images at a level on par with state-of-the-art baselines that require labels at least at training time. The novel approach can be leveraged to other domains such as artistic style transfer and presents a potential future research direction for solving image-to-image problems without manual labelling.
The paper High-Resolution Daytime Translation Without Domain Labels is on arXiv, and the code will soon be released on GitHub.
Journalist: Fangyu Cai | Editor: Michael Sarazen