Math description (Gemini)
1. Introduction to Attention
At its core, attention in machine learning can be conceptualized as a mechanism that allows a model to dynamically weigh the importance of different parts of an input sequence (or set of features) when producing an output. Instead of relying on a fixed-size hidden state to summarize all past information (as in traditional RNNs), attention provides a way to "look back" at the entire input sequence and selectively focus on the most relevant parts.
The general idea is that when producing an output at a certain position, the model assigns varying scores (attention weights) to all input positions. These weights determine how much influence each input position has on the output.
2. Mathematical Definition of Attention
2.1. Abstract Formulation
At its core, an attention mechanism computes an output representation for a given "query" by selectively aggregating information from a set of "value" representations. The selection and weighting process is guided by the interaction between the query and a corresponding set of "key" representations.
So assume we have a query vector and a set of key-value pairs . The output is a weighted sum of transformed value vectors :
where:
-
. A query vector represents the current context for which an updated representation is sought. It poses a question to the set of keys: "Based on my current state, which pieces of information from the available values are most relevant?". What is relevant is determined by the (learned) attention scores . Often .
-
is a key vector associated with the -th input element (token embedding).
-
is a value vector associated with the -th input element.
-
is the output vector of (input) vector . It contains more semantic information.
-
The Value Transformation Function . The function
transforms the value vector into an output space of dimension before aggregation. In many widely used attention mechanisms, is the identity function, i.e., . However, can also represent a learnable linear transformation or a more complex function. For instance, in the original Transformer, the value matrix itself is obtained through a linear projection of input embeddings, effectively incorporating into the generation of .
-
Attention weights. _ is the attention weight for the -th input element, indicating its relevance to the query . The attention weights are typically non-negative and sum to one (), thus forming a probability distribution over the value vectors.
-
The Scoring/Alignment/Attention Function . The attention weights are derived from alignment or attention scores , which quantify the compatibility or relevance between the query and each key . This is captured by a scoring function , which maps each query to an attention score . To produce the attention weights (probability distribution), typically softmax is applied:
Putting everything together, we get
Consider a row vector and row vectors , arranged as a matrix , and row vectors arranged as matrix .
- where denotes the scalar product.
For a row vector and a matrix we define the softmax functions as follows:
and
is the application of to each row of .
Consider pairs of key row vectors and value row vectors (think of fixed, but change with input sentence). For a given query row vector an output vector is calculated by forming the weighted sum of the transformed value vectors:
where for each
The values are the attention scores between the query vector and the key vectors . For this reason, is also called scoring function or alignment function. Their normalised version (the ) are called attention weights and determine which transformed vectors are selected (or attended to) in the weighted sum. The function is called the value transformation function.
Think of as a new embedding of that has more semantic meaning within the input, as it selectively mixes the output values according to the importance they have for the query. The attention/importance scores are computed by comparing the key vectors with the query vector.
Often, the key row vectors are arranged in a matrix and the value row vectors in a matrix . If there are a total of query row vectors, they are arranged in a matrix .
The attention weights form a probability distribution over the transformed value vectors . That is, for each and
For a self-attention mechanism, the key, value and query vectors are derived from a sequence of token embeddings (the input sequence). Assume there are token embeddings . Then there are three matrices and such that
Or, using key, value and query matrices, and the embedded vectors embedded as rows in the matrix , we have
- The token embeddings do not have to be the ones fed into the first encoder layer, as the output of each encoder produces other token embeddings, which are then fed into the next layer.
- The matrices and represents information in the token embedding . The matrices and are learned, and once learned do not change.
The choice of the global functions and leads to different specific attention mechanism. One of the most prominent one is Scaled Dot-Product Attention, where is the dot product between the vectors and , and scaled by some content for keeping the numbers in a good range.
Consider pairs of key vectors and value vectors . The query vectors are also in . An attention mechanism where the scoring function is the scalar product, and the value transformation function is the identity function, is called a scaled dot-product attention mechansim. Thus, for and any we have
Consider pairs of key row vectors and row value vectors arranged in matrices and , and query vectors arranged in the matrix . In the case of scaled dot-product attention, we have
- are the attention scores between query and the key vectors.
- are the attention weights between query and the key vectors.
- is the weighted sum of the value vectors, weighted by the attention weights between query and the key vectors.
- Row of the matrix contains the the attention weights between query and the key vectors.
- Row of the matrix is the weighted sum of the value vectors, weighted by the attention weights between query and the key vectors.
2.2. Scaled Dot-Product Attention
The most prominent attention mechanism used in Transformers is the Scaled Dot-Product Attention. In this model:
- The queries , keys , and values are packed into matrices. Let , , and , where is the number of queries, is the number of key-value pairs, is the dimension of queries and keys, and is the dimension of values.
- The compatibility function is a dot product between the query and key vectors.
- A scaling factor of is applied to the dot product to prevent very large values which could lead to vanishing gradients in the softmax function.
- The function is the identity function.
The matrix form of Scaled Dot-Product Attention is:
Here:
- is the matrix of attention scores (unnormalized weights). Each element is the dot product of the -th query with the -th key.
- is the scaling factor.
- The function is applied row-wise to the scaled attention scores, ensuring that the weights for each query sum to 1.
- The resulting matrix of weights (often denoted ) has dimensions .
- This weight matrix is then multiplied by the value matrix to produce the output . Each row of is a weighted sum of the rows of .
Self-Attention: A special case of attention is self-attention, where the queries, keys, and values are all derived from the same input sequence. For an input sequence represented by a matrix (where is sequence length, is embedding dimension), we derive using linear transformations: , , , where and are learnable weight matrices.
2.3. Other Attention Mechanisms
-
Additive Attention (Bahdanau et al.): The compatibility score is computed by a feed-forward network with a single hidden layer:
where , are weight matrices, is a bias term, and is a weight vector. This is often used in sequence-to-sequence models with RNNs.
-
Multiplicative Attention (Luong et al.): Several forms exist. A common one is:
or simply (dot-product, similar to Scaled Dot-Product but without scaling or specific matrix formulation for Transformers initially). The "general" form introduces a learnable matrix .
3. Transformer Architecture Components
The Transformer model ("Attention Is All You Need", Vaswani et al., 2017) is built upon the self-attention mechanism.
3.1. Input Embedding and Positional Encoding
- Token Embedding: Input tokens (words or sub-words) are converted into dense vectors of dimension using an embedding layer. For a token , its embedding is .
- Positional Encoding (PE): Since self-attention is permutation-equivariant (i.e., it doesn't inherently know the order of tokens), positional information must be injected. The original Transformer uses fixed sinusoidal positional encodings: where is the position of the token in the sequence, and is the dimension index (). This vector has the same dimension as the embeddings and is added to the token embeddings: . An important property is that for any fixed offset , can be represented as a linear function of .
3.2. Multi-Head Attention (MHA)
Instead of performing a single attention function with -dimensional keys, values, and queries, MHA linearly projects the queries, keys, and values times with different, learned linear projections to , , and dimensions, respectively (typically ). Attention is then performed in parallel for each of these projected versions ("heads"). The outputs of the heads are concatenated and once again projected, resulting in the final values.
where each head is:
The projection matrices are , , , and . MHA allows the model to jointly attend to information from different representation subspaces at different positions.
3.3. Position-wise Feed-Forward Networks (FFN)
Each encoder and decoder layer contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between:
The input and output dimensionality is , and the inner-layer dimensionality is typically larger (e.g., ).
3.4. Residual Connections and Layer Normalization
Each sub-layer (self-attention or FFN) in an encoder or decoder layer has a residual connection around it, followed by layer normalization. The output of a sub-layer is:
- Residual Connection (): Helps prevent vanishing gradients in deep networks and allows information to propagate more easily.
- Layer Normalization (LayerNorm): Normalizes the activations of a layer across the feature dimension for each sample independently. For a vector (representing activations for a single position), LayerNorm is: where and are the mean and variance of the elements in , is a small constant for numerical stability, and (scale) and (shift) are learnable parameters.
(Note: Pre-LN variant applies LayerNorm before the sublayer: ).
3.5. Encoder and Decoder Stacks
- Encoder: Composed of identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism and a position-wise FFN.
- Decoder: Also composed of identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack (cross-attention). The self-attention sub-layer in the decoder is modified (masked self-attention) to prevent positions from attending to subsequent positions, ensuring autoregressive generation.
Masked Self-Attention: In the decoder, to maintain the autoregressive property (i.e., prediction of the current token can only depend on previous tokens), future positions are masked out in the self-attention mechanism. This is typically done by adding to the scaled dot-product scores corresponding to future positions before the softmax, effectively making their attention weights zero. If is the mask matrix where if attention is allowed and if not, then:
4. Mathematical Properties
4.1. Computational Complexity
- Self-Attention: For a sequence of length and representation dimension , the dominant computation in self-attention is the matrix multiplication, which is . The multiplication with is also (if ). This quadratic dependence on sequence length is a major computational bottleneck for long sequences.
- Position-wise FFN: Applied to positions, with dimensions . This is .
- Multi-Head Attention: The complexity remains similar to single-head attention because computations are done in parallel for heads, but with reduced dimensions (). The projection costs are . The total is still dominated by .
The overall FLOPs for training a Transformer are often approximated as , ignoring attention FLOPs for shorter contexts. The attention FLOPs become dominant when sequence length where is .
4.2. Differentiability and Gradients
All components of the Transformer are differentiable, allowing end-to-end training using backpropagation. The gradients of the attention mechanism with respect to and the input (for self-attention) can be derived using standard matrix calculus.
Let be the loss function. The output of scaled dot-product attention is , where and . The gradients involve propagating through the softmax and matrix multiplications. For example (conceptually, actual derivations are involved):
- The gradient involves the Jacobian of the softmax function.
- (transposition due to )
The "Reversed Attention" perspective (Katz and Wolf) provides insights into the backward pass. For a VJP (gradient from upstream layers), the backward pass for the operation can be interpreted as an attention mechanism itself: and . The backward pass of value mixing is and .
4.3. Permutation Invariance/Equivariance
Standard QKV attention is equivariant with respect to re-ordering (permuting) the queries and invariant to re-ordering the key-value pairs. If and are permutation matrices: . Self-attention on an input matrix is permutation equivariant: if rows of are permuted by , the rows of the output are also permuted by . This is why positional encodings are necessary to provide order information.
4.4. Stability and Infinite Limits
Research like "The Shaped Transformer" explores the behavior of Transformers in infinite depth and width limits.
- Signal Propagation: Proper initialization (like scaled He/Glorot) and LayerNorm are crucial for stable signal propagation.
- Rank Collapse: Deep Transformers can suffer from rank collapse in attention matrices, where attention patterns become overly uniform or sparse.
- Connection to SDEs: In infinite depth limits, the evolution of token representations can sometimes be described by Stochastic Differential Equations (SDEs), providing a continuous-time perspective.
5. Advanced Mathematical Perspectives
5.1. Connection to Kernel Methods
Attention mechanisms can be related to kernel methods. The compatibility score can be seen as a learned kernel function. For scaled dot-product attention, the score is . If queries and keys are transformations of some inputs via functions (e.g., , ), then the score is . This is an inner product in a feature space. If , it defines a Mercer kernel . The work "Approximation of relation functions and attention mechanisms" (Altabaa & Lafferty) shows that attention's inner product can approximate general (even asymmetric) relation functions, connecting attention to reproducing kernel Hilbert/Banach spaces. The "kernel" is implicitly learned by the transformations .
5.2. Metric Learning and Heat Diffusion Perspective
The "Metric-Attention" mechanism (ArXiv:2412.18288) decomposes self-attention into:
- Learning a pseudo-metric between tokens.
- An information propagation step using this metric. The score is given by , where is typically a squared Euclidean distance. This formulation connects attention to Gaussian kernels if is Euclidean distance and is related to drift-diffusion processes or heat equations, where attention weights correspond to the Green's function of a heat equation on a graph.
5.3. Dynamical Systems and Gradient Flow (Conceptual)
Some research aims to interpret the layer-by-layer processing in Transformers through the lens of dynamical systems or gradient flows.
- Dynamical Systems: Each layer transforms the representation of tokens, and the sequence of layers can be seen as discretizing a continuous-time dynamical system . The MIT paper "A Mathematical Perspective on Transformers" (ArXiv:2312.10794) models Transformers as interacting particle systems where token embeddings evolve on a sphere.
- Gradient Flow: Self-attention dynamics, particularly with layer normalization, have been analyzed as gradient flows of an energy function in spaces of probability measures (e.g., ArXiv:2501.03096, Burger et al.). This suggests that attention layers might be implicitly minimizing some objective. (Details are complex and depend on specific assumptions).
5.4. Generalized Attention Mechanisms (Conceptual)
The basic softmax(scores)V structure can be generalized.
- Generalized Probabilistic Attention Mechanism (GPAM) (ArXiv:2410.15578, Heo & Choi) aimed to allow negative attention scores while preserving the total sum, potentially alleviating rank collapse. This would fit the form where are not restricted to but .
- Equivalence to FFNs: Some research (e.g., ArXiv:2501.00823 on Generalized Cross-Attention) suggests that FFNs can be viewed as a specialized case of attention, or that generalized attention forms can subsume FFN-like computations, hinting at deeper architectural unifications.
6. Conclusion
The Transformer model, while empirically powerful, is grounded in a rich set of mathematical concepts spanning linear algebra, calculus, probability, and increasingly, connections to fields like kernel methods, optimal transport, and dynamical systems. The core attention mechanism, particularly scaled dot-product attention, provides a flexible and efficient way to model dependencies within and between sequences. Ongoing research continues to explore its mathematical properties, limitations, and generalizations, pushing the boundaries of AI capabilities. Understanding this mathematical framework is crucial for both utilizing and advancing these transformative models.