Math description (ChatGpt)
Mathematical Formulation of the Transformer Architecture
The Transformer architecture can be described in abstract mathematical terms by focusing on its core mechanism: the attention operation. At a high level, an attention function maps a given query vector and a set of key-value vector pairs to an output vector. The output is computed as a weighted combination (typically a convex combination) of the value vectors, where each weight is determined by a compatibility function comparing the query with the corresponding key. This mechanism allows the model to selectively focus on different parts of an input sequence (or another sequence) when processing information.
The Attention Mechanism: Abstract Formulation
Let , , and be vector spaces for queries, keys, and values respectively. Given a query , keys , and associated values , the attention mechanism defines the output as:
where is a weighting function and is often identity.
Typically, is a normalized scoring function such as:
where measures compatibility, leading to a weighted expectation:
Scaled Dot-Product Attention
The Transformer uses the scaled dot-product attention, defined as:
where is the dimension of vectors. The output is computed as:
Matrix formulation:
Self-Attention and Cross-Attention
- Self-attention: queries, keys, values from the same set:
- Cross-attention (encoder-decoder attention): queries from one source (decoder) and keys-values from another (encoder):
Masked self-attention preserves causality in the decoder:
Multi-Head Attention
Multi-head attention operates several attention layers in parallel. For each head :
and
Output is concatenated and projected back:
Transformer Encoder: Layer Structure
Each encoder layer comprises:
- Multi-head self-attention with residual connection:
- Position-wise feed-forward network (FFN) applied per position with residual:
resulting in:
Transformer Decoder: Layer Structure and Cross-Attention
Decoder layers have:
- Masked self-attention:
- Cross-attention to encoder outputs:
- Position-wise FFN:
Conclusion
The Transformer is mathematically a layered composition of attention-based operations, linear transformations, and nonlinear activations, defined in vector spaces. Each layer uses probability-weighted sums (attention) and linear algebraic operations, providing a rich, abstract framework for sequence modeling.