Math description (Gemini)

1. Introduction to Attention

At its core, attention in machine learning can be conceptualized as a mechanism that allows a model to dynamically weigh the importance of different parts of an input sequence (or set of features) when producing an output. Instead of relying on a fixed-size hidden state to summarize all past information (as in traditional RNNs), attention provides a way to "look back" at the entire input sequence and selectively focus on the most relevant parts.

The general idea is that when producing an output at a certain position, the model assigns varying scores (attention weights) to all input positions. These weights determine how much influence each input position has on the output.

2. Mathematical Definition of Attention

2.1. Abstract Formulation

At its core, an attention mechanism computes an output representation for a given "query" by selectively aggregating information from a set of "value" representations. The selection and weighting process is guided by the interaction between the query and a corresponding set of "key" representations.

So assume we have a query vector qq and a set of key-value pairs {(k1,v1),(k2,v2),,(kn,vn)}\{(k_1, v_1), (k_2, v_2), \dots, (k_n, v_n)\}. The output o=o(q)o=o(q) is a weighted sum of transformed value vectors vjv_j:

o(q)=j=1nαjg(vj)o(q) = \sum_{j=1}^{n} \alpha_j g(v_j)

where:

Putting everything together, we get

Theorem 1

Consider a row vector a=[a1,,an]R1×na=[a_1,\dots,a_n]\in \R^{1\times n} and mm row vectors b1,,bmR1×nb_1,\dots,b_m \in \R^{1\times n}, arranged as a matrix BRm×nB\in \R^{m\times n}, and nn row vectors c1,,cnR1×pc_1,\dots,c_n \in \R^{1\times p} arranged as matrix CRn×pC\in \R^{n\times p}.

  1. aB=[ab1,,abn]R1×naB^\prime = [a\bullet b_1,\dots,a\bullet b_n] \in R^{1\times n} where \bullet denotes the scalar product.
  2. aC=a1c1++ancnR1×paC = a_1 \cdot c_1+\dots +a_n\cdot c_n \in \R^{1\times p}
Definition 1: Softmax function

For a row vector uR1×nu \in \R^{1\times n} and a matrix URm×nU\in \R^{m\times n} we define the softmax functions as follows:

softmax(u)=[eu1i=1neui,,euni=1neui]\text{softmax}(u)=[\frac{e^{u_1}}{\sum_{i=1}^n e^{u_i}},\dots,\frac{e^{u_n}}{\sum_{i=1}^n e^{u_i}}]

and

softmax(U)Rm×n\text{softmax}(U) \in \R^{m\times n}

is the application of softmax(u)\text{softmax}(u) to each row uu of UU.

Definition 2: A general attention mechanism

Consider nn pairs of key row vectors k1,,knR1×dkk_1,\dots,k_n \in \R^{1 \times d_k} and value row vectors v1,,vnR1×dvv_1,\dots,v_n \in \R^{1 \times d_v} (think of fixed, but change with input sentence). For a given query row vector qR1×dqq \in \R^{1 \times d_q} an output vector oRdoo \in \R^{d_o} is calculated by forming the weighted sum of the nn transformed value vectors:

o=o(q,k1,..,kn,v1,...,vn)=j=1nαj(q,k1,...,kn)g(vj)\begin{array}{lll} o &=& o(q, k_1,..,k_n, v_1,...,v_n)\\[0.5em] &=& \sum_{j=1}^{n} \alpha_j(q,k_1,...,k_n) g(v_j) \end{array}

where for each j=1,,nj=1,\dots,n

αj=αj(q,k1,...,kn)=ef(q,kj)i=1nef(q,ki)\begin{array}{lll} \alpha_j &=& \alpha_j(q,k_1,...,k_n) \\[0.5em] &=& \frac{e^{f(q, k_j)}}{\sum_{i=1}^{n} e^{f(q, k_i)}}\quad \end{array}

The values f(q,k1),,f(q,kn)f(q,k_1),\dots,f(q,k_n) are the attention scores between the query vector qq and the key vectors k1,knk_1,\dots k_n. For this reason, ff is also called scoring function or alignment function. Their normalised version (the α1,,αn\alpha_1,\dots,\alpha_n) are called attention weights and determine which transformed vectors g(v1),,g(vn)g(v_1),\dots,g(v_n) are selected (or attended to) in the weighted sum. The function gg is called the value transformation function.

Think of oo as a new embedding of qq that has more semantic meaning within the input, as it selectively mixes the output values according to the importance they have for the query. The nn attention/importance scores are computed by comparing the nn key vectors with the query vector.

Often, the nn key row vectors are arranged in a matrix KRn×dkK\in \R^{n\times d_k} and the nn value row vectors in a matrix VRn×dvV\in \R^{n\times d_v}. If there are a total of mm query row vectors, they are arranged in a matrix QRm×dqQ\in \R^{m\times d_q}.

Theorem 2

The attention weights α1,,αn\alpha_1,\dots,\alpha_n form a probability distribution over the transformed value vectors g(v1),,g(vn)g(v_1),\dots,g(v_n). That is, 0αj10\leq \alpha_j\leq 1 for each j=1,,nj=1,\dots,n and

j=1nαj=1\sum_{j=1}^n \alpha_j =1
Definition 3: Self-attention mechanism.

For a self-attention mechanism, the key, value and query vectors are derived from a sequence of token embeddings (the input sequence). Assume there are ll token embeddings x1,,xlR1×dmodelx_1,\dots,x_l \in\R^{1\times d_{model}}. Then there are three matrices WKRdmodel×dk,WVRdmodel×dvW^K \in \in\R^{d_{model} \times d_k}, W^V\in\R^{d_{model} \times d_v} and WQRdmodel×dqW^Q\in\R^{d_{model} \times d_q} such that

ki=xiWKR1×dk(i=1,l)vi=xiWVR1×dv(i=1,l)qi=xiWQR1×dq(i=1,l)\begin{array}{lll} k_i &=& x_i \cdot W^K \in\R^{1\times d_k}\quad (i=1,\dots\,l)\\ v_i &=& x_i \cdot W^V \in\R^{1\times d_v}\quad (i=1,\dots\,l)\\ q_i &=& x_i \cdot W^Q \in\R^{1\times d_q}\quad (i=1,\dots\,l) \end{array}

Or, using key, value and query matrices, and the embedded vectors x1,,xlx_1,\dots,x_l embedded as rows in the matrix XRl×dmodelX \in\R^{l\times d_{model}}, we have

K=XWKRl×dkV=XWVRl×dvQ=XWQRl×dq\begin{array}{lll} K &=& X \cdot W^K \in\R^{l\times d_k}\\ V &=& X \cdot W^V \in\R^{l\times d_v}\\ Q &=& X \cdot W^Q \in\R^{l\times d_q} \end{array}
notes
  1. The token embeddings XX do not have to be the ones fed into the first encoder layer, as the output of each encoder produces other token embeddings, which are then fed into the next layer.
  2. The matrices K,VK, V and QQ represents information in the token embedding XX. The matrices WK,WVW^K, W^V and WQW^Q are learned, and once learned do not change.

The choice of the global functions f(q,k)f(q, k) and g(v)g(v) leads to different specific attention mechanism. One of the most prominent one is Scaled Dot-Product Attention, where f(q,k)f(q,k) is the dot product between the vectors qq and vv, and scaled by some content for keeping the numbers in a good range.

Definition 4: Scaled Dot-Product Attention

Consider nn pairs of key vectors k1,,knR1×dkk_1,\dots,k_n \in \R^{1\times d_k} and value vectors v1,,vnR1×dvv_1,\dots,v_n \in \R^{1\times d_v}. The query vectors are also in qR1×dkq \in \R^{1\times d_k}. An attention mechanism where the scoring function ff is the scalar product, and the value transformation function gg is the identity function, is called a scaled dot-product attention mechansim. Thus, for j=1,,nj=1,\dots,n and any qq we have

f(q,kj)=qkjdkg(vj)=vj\begin{array}{lll} f(q,k_j)&=&\frac{q\bullet k_j}{\sqrt{d_k}}\\[0.5em] g(v_j)&=& v_j \end{array}
Theorem 3

Consider nn pairs of key row vectors and row value vectors arranged in matrices KRn×dkK\in \R^{n\times d_k} and VRn×dvV\in \R^{n\times d_v}, and mm query vectors q1,,qmR1×dkq_1,\dots,q_m \in \R^{1\times d_k} arranged in the matrix QRm×dkQ\in \R^{m\times d_k}. In the case of scaled dot-product attention, we have

  1. qjKR1×nq_j \cdot K^\prime \in \R^{1\times n} are the nn attention scores between query qjq_j and the nn key vectors.
  2. α=softmax(qjK)R1×n\alpha = \text{softmax}(q_j \cdot K^\prime) \in \R^{1\times n} are the nn attention weights between query qjq_j and the nn key vectors.
  3. o=softmax(qjK)VR1×dvo=\text{softmax}(q_j \cdot K^\prime)\cdot V \in \R^{1\times d_v} is the weighted sum of the nn value vectors, weighted by the attention weights between query qjq_j and the nn key vectors.
  4. Row jj of the matrix softmax(QKdk)Rm×n\text{softmax}(\frac{QK^\prime}{\sqrt{d_k}}) \in \R^{m\times n} contains the nn the attention weights between query qjq_j and the nn key vectors.
  5. Row jj of the matrix O=Attention(Q,K,V)=softmax(QKdk)VRm×dvO=\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^\prime}{\sqrt{d_k}})\cdot V \in \R^{m\times d_v} is the weighted sum of the value vectors, weighted by the attention weights between query qjq_j and the nn key vectors.

2.2. Scaled Dot-Product Attention

The most prominent attention mechanism used in Transformers is the Scaled Dot-Product Attention. In this model:

The matrix form of Scaled Dot-Product Attention is:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Here:

Self-Attention: A special case of attention is self-attention, where the queries, keys, and values are all derived from the same input sequence. For an input sequence represented by a matrix XRL×dmodelX \in \mathbb{R}^{L \times d_{\text{model}}} (where LL is sequence length, dmodeld_{\text{model}} is embedding dimension), we derive Q,K,VQ, K, V using linear transformations: Q=XWQQ = X W^Q, K=XWKK = X W^K, V=XWVV = X W^V, where WQ,WKRdmodel×dkW^Q, W^K \in \mathbb{R}^{d_{\text{model}} \times d_k} and WVRdmodel×dvW^V \in \mathbb{R}^{d_{\text{model}} \times d_v} are learnable weight matrices.

2.3. Other Attention Mechanisms

3. Transformer Architecture Components

The Transformer model ("Attention Is All You Need", Vaswani et al., 2017) is built upon the self-attention mechanism.

3.1. Input Embedding and Positional Encoding

3.2. Multi-Head Attention (MHA)

Instead of performing a single attention function with dmodeld_{\text{model}}-dimensional keys, values, and queries, MHA linearly projects the queries, keys, and values hh times with different, learned linear projections to dkd_k, dkd_k, and dvd_v dimensions, respectively (typically dk=dv=dmodel/hd_k = d_v = d_{\text{model}}/h). Attention is then performed in parallel for each of these projected versions ("heads"). The outputs of the hh heads are concatenated and once again projected, resulting in the final values.

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O

where each head is:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The projection matrices are WiQRdmodel×dkW_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiKRdmodel×dkW_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, and WORhdv×dmodelW^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}. MHA allows the model to jointly attend to information from different representation subspaces at different positions.

3.3. Position-wise Feed-Forward Networks (FFN)

Each encoder and decoder layer contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

The input and output dimensionality is dmodeld_{\text{model}}, and the inner-layer dimensionality dffd_{ff} is typically larger (e.g., dff=4dmodeld_{ff} = 4 d_{\text{model}}).

3.4. Residual Connections and Layer Normalization

Each sub-layer (self-attention or FFN) in an encoder or decoder layer has a residual connection around it, followed by layer normalization. The output of a sub-layer is:

Output=LayerNorm(x+Sublayer(x))\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))

(Note: Pre-LN variant applies LayerNorm before the sublayer: x+Sublayer(LayerNorm(x))x + \text{Sublayer}(\text{LayerNorm}(x))).

3.5. Encoder and Decoder Stacks

Masked Self-Attention: In the decoder, to maintain the autoregressive property (i.e., prediction of the current token can only depend on previous tokens), future positions are masked out in the self-attention mechanism. This is typically done by adding -\infty to the scaled dot-product scores corresponding to future positions before the softmax, effectively making their attention weights zero. If MM is the mask matrix where Mij=0M_{ij} = 0 if attention is allowed and Mij=M_{ij} = -\infty if not, then:

MaskedAttention(Q,K,V)=softmax(QKTdk+M)V\text{MaskedAttention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M \right)V

4. Mathematical Properties

4.1. Computational Complexity

The overall FLOPs for training a Transformer are often approximated as 6×(Number of Tokens)×(Number of Parameters)6 \times (\text{Number of Tokens}) \times (\text{Number of Parameters}), ignoring attention FLOPs for shorter contexts. The attention FLOPs become dominant when sequence length T>8DT > \approx 8D where DD is dmodeld_{\text{model}}.

4.2. Differentiability and Gradients

All components of the Transformer are differentiable, allowing end-to-end training using backpropagation. The gradients of the attention mechanism with respect to Q,K,VQ, K, V and the input XX (for self-attention) can be derived using standard matrix calculus.

Let LL be the loss function. The output of scaled dot-product attention is O=AVO = A V, where A=softmax(S)A = \text{softmax}(S) and S=QKTdkS = \frac{QK^T}{\sqrt{d_k}}. The gradients involve propagating through the softmax and matrix multiplications. For example (conceptually, actual derivations are involved):

The "Reversed Attention" perspective (Katz and Wolf) provides insights into the backward pass. For a VJP uu (gradient from upstream layers), the backward pass for the QKTQ K^T operation can be interpreted as an attention mechanism itself: VJPK=uTQK\text{VJP}_K = u^T Q \cdot K and VJPQ=uKQ\text{VJP}_Q = u K \cdot Q. The backward pass of value mixing AVA V is VJPV=ATu\text{VJP}_V = A^T u and VJPA=uVT\text{VJP}_A = u V^T.

4.3. Permutation Invariance/Equivariance

Standard QKV attention is equivariant with respect to re-ordering (permuting) the queries and invariant to re-ordering the key-value pairs. If PmP_m and PnP_n are permutation matrices: Attention(PmQ,PnK,PnV)=PmAttention(Q,K,V)\text{Attention}(P_m Q, P_n K, P_n V) = P_m \text{Attention}(Q, K, V). Self-attention on an input matrix XX is permutation equivariant: if rows of XX are permuted by PP, the rows of the output are also permuted by PP. This is why positional encodings are necessary to provide order information.

4.4. Stability and Infinite Limits

Research like "The Shaped Transformer" explores the behavior of Transformers in infinite depth and width limits.

5. Advanced Mathematical Perspectives

5.1. Connection to Kernel Methods

Attention mechanisms can be related to kernel methods. The compatibility score f(q,kj)f(q, k_j) can be seen as a learned kernel function. For scaled dot-product attention, the score is qTkjdk\frac{q^T k_j}{\sqrt{d_k}}. If queries and keys are transformations of some inputs xi,xjx_i, x_j via functions ϕQ,ϕK\phi_Q, \phi_K (e.g., q=ϕQ(xi)q = \phi_Q(x_i), k=ϕK(xj)k = \phi_K(x_j)), then the score is ϕQ(xi)TϕK(xj)dk\frac{\phi_Q(x_i)^T \phi_K(x_j)}{\sqrt{d_k}}. This is an inner product in a feature space. If ϕQ=ϕK=ϕ\phi_Q = \phi_K = \phi, it defines a Mercer kernel K(xi,xj)=ϕ(xi),ϕ(xj)K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle. The work "Approximation of relation functions and attention mechanisms" (Altabaa & Lafferty) shows that attention's inner product ϕθ(q),ψθ(xi)\langle \phi_\theta(q), \psi_\theta(x_i) \rangle can approximate general (even asymmetric) relation functions, connecting attention to reproducing kernel Hilbert/Banach spaces. The "kernel" is implicitly learned by the transformations WQ,WKW^Q, W^K.

5.2. Metric Learning and Heat Diffusion Perspective

The "Metric-Attention" mechanism (ArXiv:2412.18288) decomposes self-attention into:

  1. Learning a pseudo-metric d(xi,xj)d(x_i, x_j) between tokens.
  2. An information propagation step using this metric. The score is given by Sij=12σ2d(xiWQ,xjWK)2S_{ij} = -\frac{1}{2\sigma^2} d(x_i W^Q, x_j W^K)^2, where d(,)d(\cdot, \cdot) is typically a squared Euclidean distance. This formulation connects attention to Gaussian kernels if dd is Euclidean distance and is related to drift-diffusion processes or heat equations, where attention weights correspond to the Green's function of a heat equation on a graph.

5.3. Dynamical Systems and Gradient Flow (Conceptual)

Some research aims to interpret the layer-by-layer processing in Transformers through the lens of dynamical systems or gradient flows.

5.4. Generalized Attention Mechanisms (Conceptual)

The basic softmax(scores)V structure can be generalized.

6. Conclusion

The Transformer model, while empirically powerful, is grounded in a rich set of mathematical concepts spanning linear algebra, calculus, probability, and increasingly, connections to fields like kernel methods, optimal transport, and dynamical systems. The core attention mechanism, particularly scaled dot-product attention, provides a flexible and efficient way to model dependencies within and between sequences. Ongoing research continues to explore its mathematical properties, limitations, and generalizations, pushing the boundaries of AI capabilities. Understanding this mathematical framework is crucial for both utilizing and advancing these transformative models.