Large-time dynamics in transformer architectures with layer normalisation
Title
Large-time dynamics in transformer architectures with layer normalisation
Speaker
Yury Korolev - University of Bath
Abstract
This is a joint work with joint work with Martin Burger, Samira Kabri, Tim Roith, and Lukas Weigand (DESY Hamburg).
Transformers have become the backbone of many modern AI systems. A series of recent works have demonstrated that they can be understood mathematically as transformations of measures, hence they inherently have an infinite-dimensional domain and range. We focus on a special case when the propagation of a measure through the transformer follows a gradient flow in the space of probability measures on the unit sphere under a variant of the Wasserstein metric with a non-local mobility term. This allows us to investigate the emergence of either clusters or absolutely continuous measures in the large-time limit and to characterise them as stationary points of an interaction energy. We further investigate how the stationary points depend on the parameters of the transformer, in particular on the eigenvalues and eigenvectors of the product of the key and query matrices. The rigorous framework for studying the gradient flow that we provide also suggests a possible metric geometry for studying the general case (i.e. one that is not described by a gradient flow).
Bio
Yury Korolev is an Assistant Professor at the Department of Mathematical Sciences of the University of Bath. Prior to that he has worked at the universities of Cambridge, Münster and Lübeck. His research interests are mathematics of machine learning, inverse problems, imaging science, and non-smooth calculus variations. He has worked on approximation theory of neural operators, infinite-depth or -width limits of neural networks, L-infinity variational problems, regularisation theory, and applications in biomedical imaging.
When
Wednesday, October 29th, 16:00
Where
Room 322, UniGe DIBRIS/DIMA, Via Dodecaneso 35