NVIDIA has opened a fun online AI platform that can swap pet faces onto other animals. Simply upload a photo of your Spot or Sylvester, draw a rectangle around its head, click on “Translate” and voila! The AI will generate images of Russian Wolfhounds, French Bulldogs, and even American Black Bears matching the style and pose of your precious pet.
The tech behind the effect is NVIDIA’s new FUNIT, an image translation AI algorithm which has demonstrated compelling performance in transforming images from one domain to another using relatively small image datasets. Image-to-image translation is an increasingly popular machine learning research area that promises a wide range of applications in style transfer, object transfiguration, and photo enhancement.
When looking at images of an unfamiliar animal, humans can use their imagination and/or prior knowledge to guess how the animal might appear in different poses. For example seeing an image of a gazelle for the first time, a human might liken it to a deer to imagine how it would sit, stand or run. Current machine learning techniques lack this human imagination-transfer ability, and so require large-scale training datasets covering all classes of animals.
UC Berkeley’s 2017 model CycleGAN (Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks) was trained on 939 images from the “wild horse” class and 1177 images from the “zebra” class of the ImageNet dataset in order to achieve effective horse-zebra translation. NVIDIA researchers sought to reduce the quantity of required data with this new method, which draws inspiration from the human capacity for generalization.
The result was the few-shot unsupervised image-to-image translation framework FUNIT. Researchers trained FUNIT with an image dataset of various animal species, then introduced images from object classes which the model had not seen during training. The goal was to translate an input image from the training set into one that resembles images in this new class.
Based on generative adversarial networks (GANs), the FUNIT framework comprises a few-shot image translator — which consists of a content encoder, a class encoder, and a decoder — and a multi-task adversarial discriminator. The two neural nets challenge each other to optimize their weights until the generated data is indistinguishable from the real data.
Researchers added birds, flowers, and food to the scope of the image translation training dataset. Results showed the FUNIT framework outperforms baselines for few-shot unsupervised image-to-image translation tasks for both Animal Faces and North American Birds datasets, and can successfully translate images from source classes to analogous images of novel classes.
Researchers however found that the model will fail to generate realistic images if the appearance of new object classes are dramatically different from the training set. That means it can’t generate a translation for example from a kitten to a pizza pie.
The paper Few-Shot Unsupervised Image-to-Image Translation is on arXiv.
Journalist: Tony Peng | Editor: Michael Sarazen