Since their introduction in 2017, transformers have become the go-to machine learning architecture for natural language processing (NLP) and computer vision. Although they have achieved state-of-the-art performance in these fields, the theoretical framework underlying transformers remains relatively underexplored.
In the new paper A Probabilistic Interpretation of Transformers, ML Collective researcher Alexander Shim provides a probabilistic explanation of transformers’ exponential dot product attention and contrastive learning based on distributions of the exponential family.
An oft-proposed explanation for transformers’ power and performance is their attention mechanisms’ superior ability to model dependencies in long input sequences. But this doesn’t directly address how and why transformer architecture choices such as exponential dot product attention outperform the alternatives.
On this question, Shim conducts a probabilistic exploration based on distributions of the exponential family that favours statistical sampling and Sequential Monte Carlo over hybrid distributions. The study provides insights on attention and contrastive probabilities and a deeper interpretation and understanding of transformer architectures.
Overall, this work presents a detailed probabilistic interpretation of transformer architectures along with proofs for attention updates over several continuous distributions, laying the foundation for a theoretical framework for transformer architectures.
Shim suggests future research in this area could sample from an initial distribution to determine how distributions change with each layer, and test various contractive mappings to see if they generate substantially different embeddings and layer behaviour.
The ML Collective is an independent, nonprofit organization that aims to make research opportunities accessible and free by supporting open collaboration in machine learning research.
The paper A Probabilistic Interpretation of Transformers was accepted by the International Conference on Machine Learning (ICML 2021) and is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
This paper lays the groundwork for a theoretical framework for transformer designs by providing a detailed probabilistic interpretation of transformer topologies as well as proofs for attention updates over several continuous distributions.
Thanks for the valuable information and insights