Electrical Engineering Systems Seminar
Abstract:
Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this talk, we explore the softmax-attention model $f(X)=v^TX^T softmax(XWp)$, where, $X$ is the tokenized input, $v$ is the value weights, $W$ is the key-query weights, and $p$ is a tunable token/prompt. We prove that running gradient descent on $p$, or equivalently $W$, converges to a max-margin solution that separates locally-optimal tokens from non-optimal ones. When optimizing $v$ and $p$ simultaneously with logistic loss, we identify conditions under which the regularization paths converge to their respective max-margin solutions where $v$ separates the input features based on their labels. We also verify our theoretical findings through numerical insights. These results clearly formalize the role of attention as a token-selection mechanism and lay the groundwork for future research by connecting its dynamics to max-margin SVM.
Bio:
Samet Oymak is an assistant professor of Electrical and Computer Engineering at the University of California, Riverside who will be joining University of Michigan in fall. Prior to UCR, he spent a few years in industry and did a postdoc at UC Berkeley as a Simons Fellow. He obtained his PhD degree from Caltech in 2015 for which he received a Charles Wilts Prize for the best departmental thesis. At UCR, he received an NSF CAREER award as well as a Research Scholar award from Google.
Website: sodalab.engin.umich.edu
This talk is part of the Electrical Engineering Systems Seminar Series, sponsored by the Division of Engineering and Applied Science.