Deep neural networks based on Generative Adversarial Networks (GANs) have enabled end-to-end trainable photorealistic text-to-image generation. Researchers have also developed methods designed to increase user control over the process, such as dialogue-based methods that enable inputting instructions to designate the relative positions of objects in a generated scene. However, the language that can be used in these processes is restricted, and the generated images are limited to synthetic 3D visualizations or cartoons.
A team from Google Research has targeted these text-to-image shortcomings with a new system called Tag-Retrieve-Compose Synthesize (TReCS), which exploits both user text and mouse traces. The method is proposed in the recent paper Text-to-Image Generation Grounded by Fine-Grained User Attention.
Paper Abstract: We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide indepth analysis of various aspects of model building that are crucial to achieving quality and practicality in universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research. (arXiv)
Synced invited Dr. Linchao Zhu, a lecturer at the ReLER lab, University of Technology Sydney whose works focus on video representation learning, to share his thoughts on the paper Text-to-Image Generation Grounded by Fine-Grained User Attention.
How would you describe the TReCS system?
The TReCS system is a new text-image generation framework. It leverages controllable mouse traces as fine-grained visual grounding to generate high-quality images given user narratives. One of the highlights of the system is to align mouse traces with text descriptions and produce visual labels for phrases. Another key component is the introduction of intermediate semantic masks for better text-object alignment. These semantic masks are retrieved from an external dataset, and the retrieved masks are composited to form a complete semantic mask. Finally, the generated mask would be translated to a realistic image.
The system outperforms the state-of-the-art text-image generation methods under both automatic and human evaluations. It demonstrates the feasibility of producing realistic and controllable photos from complicated narratives.
Why this matters?
The TReCS system tackles a very challenging text-image generation problem where the text descriptions are long and complicated narratives. The proposed system proves that mouse traces could serve as a useful source for realistic text-image generation (other sources could be object bounding boxes, scene graphs, dialogue-based interactions). The system is evaluated on noisy narratives translated from everyday speech, which demonstrates the practicability of state-of-the-art text-image generation systems in modelling real-world free-form data.
What are potential impacts of this research?
The proposed idea could be potentially applied in offering a user-friendly human-machine interface in many industries. It could support artists to create prototypes and draw insights from machine-generated photos. It may also be used to generate creative and artistic content for social platforms.
What are some bottlenecks related to this research?
There are increasing interests in generating realistic images based on text data. One of the bottlenecks is the missing of suitable evaluation metrics to quantitively measure the quality of the generated images. The existing metrics might not well reflect the semantic similarity between the ground-truth image and the machine-generated one. Besides, the model’s generalization capabilities are less studied in out-of-distribution text-pairs.
What is the potential future in this field?
In the future, larger datasets might be used to enable the creation of more diverse and realistic images. Another potential trend might be designing a human-in-the-loop evaluation system to interactively optimize the network.
The paper Text-to-Image Generation Grounded by Fine-Grained User Attention is on arXiv.
About Dr. Linchao Zhu
Dr. Linchao Zhu is a lecturer at the ReLER lab, University of Technology Sydney. His research interests include video representation learning, unsupervised learning, self-supervised learning, few-shot learning, transfer learning. He has published papers in TPAMI, IJCV, CVPR, ICCV, ECCV, AAAI. He participated in multiple international competitions and ranked 1st in THUMOS 2015, TRECVID LOC 2016, EPIC-Kitchen 2019.
Synced Insight Partner Program
The Synced Insight Partner Programis an invitation-only program that brings together influential organizations, companies, academic experts and industry leaders to share professional experiences and insights through interviews and public speaking engagements, etc. Synced invites all industry experts, professionals, analysts, and others working in AI technologies and machine learning to participate.
Simply Apply for the Synced Insight Partner Program and let us know about yourself and your focus in AI. We will give you a response once your application is approved.