Math description (Gemini)

1. Introduction to Attention

At its core, attention in machine learning can be conceptualized as a mechanism that allows a model to dynamically weigh the importance of different parts of an input sequence (or set of features) when producing an output. Instead of relying on a fixed-size hidden state to summarize all past information (as in traditional RNNs), attention provides a way to "look back" at the entire input sequence and selectively focus on the most relevant parts.

The general idea is that when producing an output at a certain position, the model assigns varying scores (attention weights) to all input positions. These weights determine how much influence each input position has on the output.

2. Mathematical Definition of Attention

2.1. Abstract Formulation

At its core, an attention mechanism computes an output representation for a given "query" by selectively aggregating information from a set of "value" representations. The selection and weighting process is guided by the interaction between the query and a corresponding set of "key" representations.

So assume we have a query vector $q$ and a set of key-value pairs $\{(k_1, v_1), (k_2, v_2), \dots, (k_n, v_n)\}$ . The output $o=o(q)$ is a weighted sum of transformed value vectors $v_j$ :

o(q) = \sum_{j=1}^{n} \alpha_j g(v_j)

where:

$q\in \R^{d_q}$ . A query vector $q$ represents the current context for which an updated representation is sought. It poses a question to the set of keys: "Based on my current state, which pieces of information from the available values $v_1,...,v_n$ are most relevant?". What is relevant is determined by the (learned) attention scores $\alpha_1,...\alpha_n$ . Often $d_q=d_k$ .
$k_j\in \R^{d_k}$ is a key vector associated with the $j$ -th input element (token embedding).
$v_j\in \R^{d_v}$ is a value vector associated with the $j$ -th input element.
$o \in \R^{d_o}$ is the output vector of (input) vector $q$ . It contains more semantic information.
The Value Transformation Function $g$ . The function
$g:\R^{d_v} \rightarrow \R^{d_o}$
transforms the value vector $v_j$ into an output space of dimension $d_o$ before aggregation. In many widely used attention mechanisms, $g$ is the identity function, i.e., $g(v_j)=v_j$ . However, $g$ can also represent a learnable linear transformation or a more complex function. For instance, in the original Transformer, the value matrix $V$ itself is obtained through a linear projection of input embeddings, effectively incorporating $g$ into the generation of $V$ .
Attention weights. $\alpha_j\in \R$ _ is the attention weight for the $j$ -th input element, indicating its relevance to the query $q$ . The attention weights $\alpha_j$ are typically non-negative and sum to one ( $\sum_j \alpha_j = 1$ ), thus forming a probability distribution over the value vectors.
The Scoring/Alignment/Attention Function $f$ . The attention weights $\alpha_j$ are derived from alignment or attention scores $e_j$ , which quantify the compatibility or relevance between the query $q$ and each key $k_j$ . This is captured by a scoring function $f: (\R^{d_q}, \R^{d_k}) \rightarrow \R$ , which maps each query to an attention score $e_j=f(q,k_j)$ . To produce the attention weights (probability distribution), typically softmax is applied:
$\alpha_j=\alpha_j(q,k_1,...,k_n) = \frac{\exp(e_j)}{\sum_{i=1}^{n} \exp(e_i)} =\frac{\exp(f(q, k_j))}{\sum_{i=1}^{n} \exp(f(q, k_i))}$

Putting everything together, we get

Theorem 1

Consider a row vector $a=[a_1,\dots,a_n]\in \R^{1\times n}$ and $m$ row vectors $b_1,\dots,b_m \in \R^{1\times n}$ , arranged as a matrix $B\in \R^{m\times n}$ , and $n$ row vectors $c_1,\dots,c_n \in \R^{1\times p}$ arranged as matrix $C\in \R^{n\times p}$ .

$aB^\prime = [a\bullet b_1,\dots,a\bullet b_n] \in R^{1\times n}$ where $\bullet$ denotes the scalar product.
$aC = a_1 \cdot c_1+\dots +a_n\cdot c_n \in \R^{1\times p}$

Definition 1: Softmax function

For a row vector $u \in \R^{1\times n}$ and a matrix $U\in \R^{m\times n}$ we define the softmax functions as follows:

\text{softmax}(u)=[\frac{e^{u_1}}{\sum_{i=1}^n e^{u_i}},\dots,\frac{e^{u_n}}{\sum_{i=1}^n e^{u_i}}]

and

\text{softmax}(U) \in \R^{m\times n}

is the application of $\text{softmax}(u)$ to each row $u$ of $U$ .

Definition 2: A general attention mechanism

Consider $n$ pairs of key row vectors $k_1,\dots,k_n \in \R^{1 \times d_k}$ and value row vectors $v_1,\dots,v_n \in \R^{1 \times d_v}$ (think of fixed, but change with input sentence). For a given query row vector $q \in \R^{1 \times d_q}$ an output vector $o \in \R^{d_o}$ is calculated by forming the weighted sum of the $n$ transformed value vectors:

\begin{array}{lll} o &=& o(q, k_1,..,k_n, v_1,...,v_n)\\[0.5em] &=& \sum_{j=1}^{n} \alpha_j(q,k_1,...,k_n) g(v_j) \end{array}

where for each $j=1,\dots,n$

\begin{array}{lll} \alpha_j &=& \alpha_j(q,k_1,...,k_n) \\[0.5em] &=& \frac{e^{f(q, k_j)}}{\sum_{i=1}^{n} e^{f(q, k_i)}}\quad \end{array}

The values $f(q,k_1),\dots,f(q,k_n)$ are the attention scores between the query vector $q$ and the key vectors $k_1,\dots k_n$ . For this reason, $f$ is also called scoring function or alignment function. Their normalised version (the $\alpha_1,\dots,\alpha_n$ ) are called attention weights and determine which transformed vectors $g(v_1),\dots,g(v_n)$ are selected (or attended to) in the weighted sum. The function $g$ is called the value transformation function.

Think of $o$ as a new embedding of $q$ that has more semantic meaning within the input, as it selectively mixes the output values according to the importance they have for the query. The $n$ attention/importance scores are computed by comparing the $n$ key vectors with the query vector.

Often, the $n$ key row vectors are arranged in a matrix $K\in \R^{n\times d_k}$ and the $n$ value row vectors in a matrix $V\in \R^{n\times d_v}$ . If there are a total of $m$ query row vectors, they are arranged in a matrix $Q\in \R^{m\times d_q}$ .

Theorem 2

The attention weights $\alpha_1,\dots,\alpha_n$ form a probability distribution over the transformed value vectors $g(v_1),\dots,g(v_n)$ . That is, $0\leq \alpha_j\leq 1$ for each $j=1,\dots,n$ and

\sum_{j=1}^n \alpha_j =1

Definition 3: Self-attention mechanism.

For a self-attention mechanism, the key, value and query vectors are derived from a sequence of token embeddings (the input sequence). Assume there are $l$ token embeddings $x_1,\dots,x_l \in\R^{1\times d_{model}}$ . Then there are three matrices $W^K \in \in\R^{d_{model} \times d_k}, W^V\in\R^{d_{model} \times d_v}$ and $W^Q\in\R^{d_{model} \times d_q}$ such that

\begin{array}{lll} k_i &=& x_i \cdot W^K \in\R^{1\times d_k}\quad (i=1,\dots\,l)\\ v_i &=& x_i \cdot W^V \in\R^{1\times d_v}\quad (i=1,\dots\,l)\\ q_i &=& x_i \cdot W^Q \in\R^{1\times d_q}\quad (i=1,\dots\,l) \end{array}

Or, using key, value and query matrices, and the embedded vectors $x_1,\dots,x_l$ embedded as rows in the matrix $X \in\R^{l\times d_{model}}$ , we have

\begin{array}{lll} K &=& X \cdot W^K \in\R^{l\times d_k}\\ V &=& X \cdot W^V \in\R^{l\times d_v}\\ Q &=& X \cdot W^Q \in\R^{l\times d_q} \end{array}

notes

The token embeddings $X$ do not have to be the ones fed into the first encoder layer, as the output of each encoder produces other token embeddings, which are then fed into the next layer.
The matrices $K, V$ and $Q$ represents information in the token embedding $X$ . The matrices $W^K, W^V$ and $W^Q$ are learned, and once learned do not change.

The choice of the global functions $f(q, k)$ and $g(v)$ leads to different specific attention mechanism. One of the most prominent one is Scaled Dot-Product Attention, where $f(q,k)$ is the dot product between the vectors $q$ and $v$ , and scaled by some content for keeping the numbers in a good range.

Definition 4: Scaled Dot-Product Attention

Consider $n$ pairs of key vectors $k_1,\dots,k_n \in \R^{1\times d_k}$ and value vectors $v_1,\dots,v_n \in \R^{1\times d_v}$ . The query vectors are also in $q \in \R^{1\times d_k}$ . An attention mechanism where the scoring function $f$ is the scalar product, and the value transformation function $g$ is the identity function, is called a scaled dot-product attention mechansim. Thus, for $j=1,\dots,n$ and any $q$ we have

\begin{array}{lll} f(q,k_j)&=&\frac{q\bullet k_j}{\sqrt{d_k}}\\[0.5em] g(v_j)&=& v_j \end{array}

Theorem 3

Consider $n$ pairs of key row vectors and row value vectors arranged in matrices $K\in \R^{n\times d_k}$ and $V\in \R^{n\times d_v}$ , and $m$ query vectors $q_1,\dots,q_m \in \R^{1\times d_k}$ arranged in the matrix $Q\in \R^{m\times d_k}$ . In the case of scaled dot-product attention, we have

$q_j \cdot K^\prime \in \R^{1\times n}$ are the $n$ attention scores between query $q_j$ and the $n$ key vectors.
$\alpha = \text{softmax}(q_j \cdot K^\prime) \in \R^{1\times n}$ are the $n$ attention weights between query $q_j$ and the $n$ key vectors.
$o=\text{softmax}(q_j \cdot K^\prime)\cdot V \in \R^{1\times d_v}$ is the weighted sum of the $n$ value vectors, weighted by the attention weights between query $q_j$ and the $n$ key vectors.
Row $j$ of the matrix $\text{softmax}(\frac{QK^\prime}{\sqrt{d_k}}) \in \R^{m\times n}$ contains the $n$ the attention weights between query $q_j$ and the $n$ key vectors.
Row $j$ of the matrix $O=\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^\prime}{\sqrt{d_k}})\cdot V \in \R^{m\times d_v}$ is the weighted sum of the value vectors, weighted by the attention weights between query $q_j$ and the $n$ key vectors.

2.2. Scaled Dot-Product Attention

The most prominent attention mechanism used in Transformers is the Scaled Dot-Product Attention. In this model:

The queries $Q$ , keys $K$ , and values $V$ are packed into matrices. Let $Q \in \mathbb{R}^{m \times d_k}$ , $K \in \mathbb{R}^{n \times d_k}$ , and $V \in \mathbb{R}^{n \times d_v}$ , where $m$ is the number of queries, $n$ is the number of key-value pairs, $d_k$ is the dimension of queries and keys, and $d_v$ is the dimension of values.
The compatibility function $f(q, k)$ is a dot product between the query and key vectors.
A scaling factor of $1/\sqrt{d_k}$ is applied to the dot product to prevent very large values which could lead to vanishing gradients in the softmax function.
The function $g(v)$ is the identity function.

The matrix form of Scaled Dot-Product Attention is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Here:

$QK^T \in \mathbb{R}^{m \times n}$ is the matrix of attention scores (unnormalized weights). Each element $(i,j)$ is the dot product of the $i$ -th query with the $j$ -th key.
$\frac{1}{\sqrt{d_k}}$ is the scaling factor.
The $\text{softmax}$ function is applied row-wise to the scaled attention scores, ensuring that the weights for each query sum to 1.
The resulting matrix of weights (often denoted $A$ ) has dimensions $\mathbb{R}^{m \times n}$ .
This weight matrix $A$ is then multiplied by the value matrix $V \in \mathbb{R}^{n \times d_v}$ to produce the output $O \in \mathbb{R}^{m \times d_v}$ . Each row of $O$ is a weighted sum of the rows of $V$ .

Self-Attention: A special case of attention is self-attention, where the queries, keys, and values are all derived from the same input sequence. For an input sequence represented by a matrix $X \in \mathbb{R}^{L \times d_{\text{model}}}$ (where $L$ is sequence length, $d_{\text{model}}$ is embedding dimension), we derive $Q, K, V$ using linear transformations: $Q = X W^Q$ , $K = X W^K$ , $V = X W^V$ , where $W^Q, W^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ and $W^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ are learnable weight matrices.

2.3. Other Attention Mechanisms

Additive Attention (Bahdanau et al.): The compatibility score $f(q, k_j)$ is computed by a feed-forward network with a single hidden layer:
$f(q, k_j) = w_a^T \tanh(W_q q + W_k k_j + b)$
where $W_q$ , $W_k$ are weight matrices, $b$ is a bias term, and $w_a$ is a weight vector. This is often used in sequence-to-sequence models with RNNs.
Multiplicative Attention (Luong et al.): Several forms exist. A common one is:
$f(q, k_j) = q^T W_a k_j$
or simply $f(q, k_j) = q^T k_j$ (dot-product, similar to Scaled Dot-Product but without scaling or specific matrix formulation for Transformers initially). The "general" form $q^T W_a k_j$ introduces a learnable matrix $W_a$ .

3. Transformer Architecture Components

The Transformer model ("Attention Is All You Need", Vaswani et al., 2017) is built upon the self-attention mechanism.

3.1. Input Embedding and Positional Encoding

Token Embedding: Input tokens (words or sub-words) are converted into dense vectors of dimension $d_{\text{model}}$ using an embedding layer. For a token $t_i$ , its embedding is $x_i = \text{Embed}(t_i)$ .
Positional Encoding (PE): Since self-attention is permutation-equivariant (i.e., it doesn't inherently know the order of tokens), positional information must be injected. The original Transformer uses fixed sinusoidal positional encodings: $PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$ $PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$ where $pos$ is the position of the token in the sequence, and $i$ is the dimension index ( $0 \le 2i < d_{\text{model}}$ ). This $PE$ vector has the same dimension $d_{\text{model}}$ as the embeddings and is added to the token embeddings: $x_{\text{input}} = x_{\text{embedding}} + x_{\text{positional}}$ . An important property is that for any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ .

3.2. Multi-Head Attention (MHA)

Instead of performing a single attention function with $d_{\text{model}}$ -dimensional keys, values, and queries, MHA linearly projects the queries, keys, and values $h$ times with different, learned linear projections to $d_k$ , $d_k$ , and $d_v$ dimensions, respectively (typically $d_k = d_v = d_{\text{model}}/h$ ). Attention is then performed in parallel for each of these projected versions ("heads"). The outputs of the $h$ heads are concatenated and once again projected, resulting in the final values.

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O

where each head is:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The projection matrices are $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ , and $W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}$ . MHA allows the model to jointly attend to information from different representation subspaces at different positions.

3.3. Position-wise Feed-Forward Networks (FFN)

Each encoder and decoder layer contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between:

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

The input and output dimensionality is $d_{\text{model}}$ , and the inner-layer dimensionality $d_{ff}$ is typically larger (e.g., $d_{ff} = 4 d_{\text{model}}$ ).

3.4. Residual Connections and Layer Normalization

Each sub-layer (self-attention or FFN) in an encoder or decoder layer has a residual connection around it, followed by layer normalization. The output of a sub-layer is:

\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))

Residual Connection ( $x + \text{Sublayer}(x)$ ): Helps prevent vanishing gradients in deep networks and allows information to propagate more easily.
Layer Normalization (LayerNorm): Normalizes the activations of a layer across the feature dimension for each sample independently. For a vector $x$ (representing activations for a single position), LayerNorm is: $\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \gamma + \beta$ where $\mu$ and $\sigma^2$ are the mean and variance of the elements in $x$ , $\epsilon$ is a small constant for numerical stability, and $\gamma$ (scale) and $\beta$ (shift) are learnable parameters.

(Note: Pre-LN variant applies LayerNorm before the sublayer: $x + \text{Sublayer}(\text{LayerNorm}(x))$ ).

3.5. Encoder and Decoder Stacks

Encoder: Composed of $N$ identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism and a position-wise FFN.
Decoder: Also composed of $N$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack (cross-attention). The self-attention sub-layer in the decoder is modified (masked self-attention) to prevent positions from attending to subsequent positions, ensuring autoregressive generation.

Masked Self-Attention: In the decoder, to maintain the autoregressive property (i.e., prediction of the current token can only depend on previous tokens), future positions are masked out in the self-attention mechanism. This is typically done by adding $-\infty$ to the scaled dot-product scores corresponding to future positions before the softmax, effectively making their attention weights zero. If $M$ is the mask matrix where $M_{ij} = 0$ if attention is allowed and $M_{ij} = -\infty$ if not, then:

\text{MaskedAttention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M \right)V

4. Mathematical Properties

4.1. Computational Complexity

Self-Attention: For a sequence of length $n$ and representation dimension $d$ , the dominant computation in self-attention is the $QK^T$ matrix multiplication, which is $O(n^2 d)$ . The multiplication with $V$ is also $O(n^2 d_v)$ (if $d_v \approx d$ ). This quadratic dependence on sequence length is a major computational bottleneck for long sequences.
Position-wise FFN: Applied to $n$ positions, with dimensions $d \to d_{ff} \to d$ . This is $O(n \cdot (d \cdot d_{ff} + d_{ff} \cdot d)) = O(n \cdot d \cdot d_{ff})$ .
Multi-Head Attention: The complexity remains similar to single-head attention because computations are done in parallel for $h$ heads, but with reduced dimensions ( $d_k = d/h$ ). The projection costs are $O(n d^2)$ . The total is still dominated by $O(n^2 d)$ .

The overall FLOPs for training a Transformer are often approximated as $6 \times (\text{Number of Tokens}) \times (\text{Number of Parameters})$ , ignoring attention FLOPs for shorter contexts. The attention FLOPs become dominant when sequence length $T > \approx 8D$ where $D$ is $d_{\text{model}}$ .

4.2. Differentiability and Gradients

All components of the Transformer are differentiable, allowing end-to-end training using backpropagation. The gradients of the attention mechanism with respect to $Q, K, V$ and the input $X$ (for self-attention) can be derived using standard matrix calculus.

Let $L$ be the loss function. The output of scaled dot-product attention is $O = A V$ , where $A = \text{softmax}(S)$ and $S = \frac{QK^T}{\sqrt{d_k}}$ . The gradients involve propagating through the softmax and matrix multiplications. For example (conceptually, actual derivations are involved):

$\frac{\partial L}{\partial V} = A^T \frac{\partial L}{\partial O}$
$\frac{\partial L}{\partial A} = \frac{\partial L}{\partial O} V^T$
The gradient $\frac{\partial L}{\partial S}$ involves the Jacobian of the softmax function.
$\frac{\partial L}{\partial Q} = \frac{\partial L}{\partial S} K / \sqrt{d_k}$
$\frac{\partial L}{\partial K} = \left(\frac{\partial L}{\partial S}\right)^T Q / \sqrt{d_k}$ (transposition due to $K^T$ )

The "Reversed Attention" perspective (Katz and Wolf) provides insights into the backward pass. For a VJP $u$ (gradient from upstream layers), the backward pass for the $Q K^T$ operation can be interpreted as an attention mechanism itself: $\text{VJP}_K = u^T Q \cdot K$ and $\text{VJP}_Q = u K \cdot Q$ . The backward pass of value mixing $A V$ is $\text{VJP}_V = A^T u$ and $\text{VJP}_A = u V^T$ .

4.3. Permutation Invariance/Equivariance

Standard QKV attention is equivariant with respect to re-ordering (permuting) the queries and invariant to re-ordering the key-value pairs. If $P_m$ and $P_n$ are permutation matrices: $\text{Attention}(P_m Q, P_n K, P_n V) = P_m \text{Attention}(Q, K, V)$ . Self-attention on an input matrix $X$ is permutation equivariant: if rows of $X$ are permuted by $P$ , the rows of the output are also permuted by $P$ . This is why positional encodings are necessary to provide order information.

4.4. Stability and Infinite Limits

Research like "The Shaped Transformer" explores the behavior of Transformers in infinite depth and width limits.

Signal Propagation: Proper initialization (like scaled He/Glorot) and LayerNorm are crucial for stable signal propagation.
Rank Collapse: Deep Transformers can suffer from rank collapse in attention matrices, where attention patterns become overly uniform or sparse.
Connection to SDEs: In infinite depth limits, the evolution of token representations can sometimes be described by Stochastic Differential Equations (SDEs), providing a continuous-time perspective.

5. Advanced Mathematical Perspectives

5.1. Connection to Kernel Methods

Attention mechanisms can be related to kernel methods. The compatibility score $f(q, k_j)$ can be seen as a learned kernel function. For scaled dot-product attention, the score is $\frac{q^T k_j}{\sqrt{d_k}}$ . If queries and keys are transformations of some inputs $x_i, x_j$ via functions $\phi_Q, \phi_K$ (e.g., $q = \phi_Q(x_i)$ , $k = \phi_K(x_j)$ ), then the score is $\frac{\phi_Q(x_i)^T \phi_K(x_j)}{\sqrt{d_k}}$ . This is an inner product in a feature space. If $\phi_Q = \phi_K = \phi$ , it defines a Mercer kernel $K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$ . The work "Approximation of relation functions and attention mechanisms" (Altabaa & Lafferty) shows that attention's inner product $\langle \phi_\theta(q), \psi_\theta(x_i) \rangle$ can approximate general (even asymmetric) relation functions, connecting attention to reproducing kernel Hilbert/Banach spaces. The "kernel" is implicitly learned by the transformations $W^Q, W^K$ .

5.2. Metric Learning and Heat Diffusion Perspective

The "Metric-Attention" mechanism (ArXiv:2412.18288) decomposes self-attention into:

Learning a pseudo-metric $d(x_i, x_j)$ between tokens.
An information propagation step using this metric. The score is given by $S_{ij} = -\frac{1}{2\sigma^2} d(x_i W^Q, x_j W^K)^2$ , where $d(\cdot, \cdot)$ is typically a squared Euclidean distance. This formulation connects attention to Gaussian kernels if $d$ is Euclidean distance and is related to drift-diffusion processes or heat equations, where attention weights correspond to the Green's function of a heat equation on a graph.

5.3. Dynamical Systems and Gradient Flow (Conceptual)

Some research aims to interpret the layer-by-layer processing in Transformers through the lens of dynamical systems or gradient flows.

Dynamical Systems: Each layer transforms the representation of tokens, and the sequence of layers can be seen as discretizing a continuous-time dynamical system $dx_i(t)/dt = F(x_1(t), \dots, x_n(t))$ . The MIT paper "A Mathematical Perspective on Transformers" (ArXiv:2312.10794) models Transformers as interacting particle systems where token embeddings evolve on a sphere.
Gradient Flow: Self-attention dynamics, particularly with layer normalization, have been analyzed as gradient flows of an energy function in spaces of probability measures (e.g., ArXiv:2501.03096, Burger et al.). This suggests that attention layers might be implicitly minimizing some objective. (Details are complex and depend on specific assumptions).

5.4. Generalized Attention Mechanisms (Conceptual)

The basic softmax(scores)V structure can be generalized.

Generalized Probabilistic Attention Mechanism (GPAM) (ArXiv:2410.15578, Heo & Choi) aimed to allow negative attention scores while preserving the total sum, potentially alleviating rank collapse. This would fit the $o = \sum_j \alpha_j g(v_j)$ form where $\alpha_j$ are not restricted to $(0,1)$ but $\sum \alpha_j = 1$ .
Equivalence to FFNs: Some research (e.g., ArXiv:2501.00823 on Generalized Cross-Attention) suggests that FFNs can be viewed as a specialized case of attention, or that generalized attention forms can subsume FFN-like computations, hinting at deeper architectural unifications.

6. Conclusion

The Transformer model, while empirically powerful, is grounded in a rich set of mathematical concepts spanning linear algebra, calculus, probability, and increasingly, connections to fields like kernel methods, optimal transport, and dynamical systems. The core attention mechanism, particularly scaled dot-product attention, provides a flexible and efficient way to model dependencies within and between sequences. Ongoing research continues to explore its mathematical properties, limitations, and generalizations, pushing the boundaries of AI capabilities. Understanding this mathematical framework is crucial for both utilizing and advancing these transformative models.