Is there a way to efficiently code inductive image biases into models while retaining all the flexibility of transformers? “Yes,” say researchers from Germany’s Heidelberg University. In a new paper, the team proposes a novel approach that combines the effectiveness of the inductive bias in convolutional neural networks (CNNs) with the expressivity of transformers to model and synthesize high resolution images.
Synced recently showcased ten NeurIPS 2020 papers illustrating trends in transformers — ranging from extended use of the neural architecture to innovative advancements in technique, architectural design changes and more — in NeurIPS 2020 | Teaching Transformers New Tricks. Transformer architectures excel in tasks with sequential data; they have revolutionized natural language processing and have recently also been applied to reinforcement learning, computer vision and symbolic mathematics.
A transformer trade-off however is that they contain no inductive bias to prioritize local interactions, which makes them computationally infeasible for long sequences such as high-resolution images. The Heidelberg University team’s proposed method addresses this by first using CNNs to learn a context-rich vocabulary of image constituents, then utilizing transformers to efficiently model their composition within the images. The team explains that rather than representing images with pixels, the proposed approach represents them as a composition of perceptually rich image constituents from a codebook of context-rich visual parts.
This significantly reduces the description length of compositions, enabling researchers to efficiently model the global interrelations within images using a transformer architecture. The generated images are realistic and consistent high-resolution both in an unconditional and a conditional setting.
In evaluations, the proposed method outperformed SOTA codebook-based approaches based on convolutional architectures. The team says the CNN and transformer combination “taps into the full potential of their complementary strengths” and represents the first high-resolution image synthesis results with a transformer-based architecture.
The paper Taming Transformers for High-Resolution Image Synthesis is on arXiv. This project is also on GitHub.
Analyst: Yuqing Li | Editor: Michael Sarazen; Fangyu Cai
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.