If you find it hard to imagine what to make with flour, sugar, milk, eggs, vanilla extract and lemon peel; or what sort of dish might develop by combining shrimp, red pepper, red onion and carrots, then have no fear — the new Cook Generative Adversarial Network (CookGAN) can picture the possibilities for you.
Proposed by researchers from the Rutgers University and Samsung AI Center in the UK, CookGAN uses an attention-based ingredients-image association model to condition a generative neural network tasked with synthesizing meal images. The framework enables the model to generate realistic — and even appetizing — meal images corresponding to an ingredients list alone.
Previous works on synthesis of images from text generally rely on pretrained text models to extract text features. GANs were then used to generate realistic images conditioned on the text features. But these models mainly focus on generating well-structured singular objects consistent in appearance, such as birds and flowers for example.
Meal images are significantly more complicated, consisting of multiple ingredients whose appearance and spatial qualities are usually further modified by various cooking methods. Unlike previous studies that generated either low resolution (e.g. 128×128 pixel) images or only images of certain types of food (e.g. pizza), the new model can generate meal images of many food types and ingredients.
The researchers first trained an attention-based association model to match an ingredient list and its corresponding image in a joint latent space, then used the latent representation of the ingredient list to train a GAN to synthesize the meal image conditioned on the list.
The researchers drew data from the Recipe1M dataset, which includes about a million recipes with titles, instructions, ingredients, and images. They built a subset of 402,760 images that have at least one ingredient and one instruction but no more than 20 of either. Data was split into 70 percent for training, 15 percent for validation and 15 percent as test sets — using at most five images from each recipe.
Computational food analysis has become one of the pivotal areas for the computer vision community in recent years due partly to real-world implications for nutritional health. In the future the researchers say they plan to add recipe instructions and titles for further contextualization, and integrate ingredient amounts so the generated images will better reflect relative ingredient quantities.
The paper CookGAN: Meal Image Synthesis from Ingredients is on arXiv.
Journalist: Yuan Yuan | Editor: Michael Sarazen