Attention architectures are pushing the frontier in many machine learning (ML) tasks and have become a building block in many modern neural networks. Our conceptual and theoretical understanding of their power and inherent limitations however remains nascent. Researchers from Microsoft and Université de Montréal set out to capture the essential mathematical properties of attention, proposing a new mathematical framework that uses measure theory and integral operators to model attention and quantify the regularity of attention operations.
Measure theory is an advanced mathematical approach to measuring both in classical Euclidean spaces and for abstract aspects. Measure theory informs Kolmogorov axioms, which have formed the foundation of probability theory since their introduction in 1933. The recent rapid development of attention mechanisms for deep learning inspired the Microsoft and Université de Montréal researchers to apply measure theory in an exploration of the mathematical properties of attention in their study On the Regularity of Attention.
The researchers introduce a mathematical framework that leverages measure theory and integral operators to model attention and capture its essential properties. They demonstrate that the attention operation is Lipschitz continuous on compact domains and provide an estimate of the Lipschitz constant. The results are also then extended to non-compact domains.
While the training and generalization of ML models are vital to their practical use, a key prerequisite for improved model design is a better understanding of their training and stability. Regularity is one of the basic mathematical properties of attention operations and involves measuring how “close” the outputs of an attention operation are — calculated in terms of the closeness of the inputs and the parameters of the attention block — to basically measure the degree of continuity or smoothness of the functions.
To quantify the regularity of the attention operation, the researchers first formulated attention in terms of measure theory and integral operators, then used this framework to study regularity in terms of Lipschitz continuity, which defines a strong continuity of functions. If changing an input by a certain amount does not change its output by more than K times that amount, then the function is said to be Lipschitz continuous, and the constant K thus becomes the hard constraint on how rapidly the function’s output can vary.
To assess the impact of these regularity results, the researchers explored scenarios such as cross-attention; robustness and token-level perturbations in Natural Language Processing; and sophisticated extensions to the transformer architecture. The results showed that:
- Within the framework, the resulting representation is Lipschitz continuous with respect to the output semantic space.
- The modelling could potentially be used to derive predictions of the distance between a self-attention network’s contextual embeddings as a function of the context to test this hypothesis.
- The modelling could also potentially be used to design better model components that reduce this “regularity mismatch” for specific perturbations that are highly irregular.
- The results provide sufficient conditions for deep self-attention transformers to be invertible.
- For infinitely-deep attention models, the results shed light on the importance of input injection, which produces a data-dependent fixed point.
The paper On the Regularity of Attention is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.