Math description (ChatGpt)

Mathematical Formulation of the Transformer Architecture

The Transformer architecture can be described in abstract mathematical terms by focusing on its core mechanism: the attention operation. At a high level, an attention function maps a given query vector and a set of key-value vector pairs to an output vector. The output is computed as a weighted combination (typically a convex combination) of the value vectors, where each weight is determined by a compatibility function comparing the query with the corresponding key. This mechanism allows the model to selectively focus on different parts of an input sequence (or another sequence) when processing information.

The Attention Mechanism: Abstract Formulation

Let $Q$ , $K$ , and $V$ be vector spaces for queries, keys, and values respectively. Given a query $q \in Q$ , keys $k_i \in K$ , and associated values $v_i \in V$ , the attention mechanism defines the output $o$ as:

o = \sum_{i=1}^N f(q, k_i)\cdot g(v_i),

where $f: Q \times K \to \mathbb{R}$ is a weighting function and $g: V \to V$ is often identity.

Typically, $f(q,k)$ is a normalized scoring function such as:

f(q, k_i) = \frac{\exp(a(q,k_i))}{\sum_{j=1}^N \exp(a(q,k_j))},

where $a(q,k) \in \mathbb{R}$ measures compatibility, leading to a weighted expectation:

o = \mathbb{E}_{i \sim \text{softmax}(a(q,K))}[v_i].

Scaled Dot-Product Attention

The Transformer uses the scaled dot-product attention, defined as:

a(q, k) = \frac{\langle q, k\rangle}{\sqrt{d}},

where $d$ is the dimension of vectors. The output is computed as:

o = \sum_{i=1}^N \frac{\exp(\frac{\langle q, k_i\rangle}{\sqrt{d}})}{\sum_{j=1}^N \exp(\frac{\langle q, k_j\rangle}{\sqrt{d}})} v_i.

Matrix formulation:

\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V.

Self-Attention and Cross-Attention

Self-attention: queries, keys, values from the same set:

Y = \text{Attention}(X,X,X).

Cross-attention (encoder-decoder attention): queries from one source (decoder) and keys-values from another (encoder):

\text{CrossAttn}(d,E) = \sum_{i=1}^{N_E} f(d,e_i)e_i.

Masked self-attention preserves causality in the decoder:

f(q,k_i) = 0 \quad \text{if } i > j.

Multi-Head Attention

Multi-head attention operates several attention layers in parallel. For each head $j$ :

Q_j = XW_j^Q,\quad K_j = XW_j^K,\quad V_j = XW_j^V,

and

O_j = \text{Attention}(Q_j,K_j,V_j).

Output is concatenated and projected back:

\text{MultiHead}(X) = [O_1 \oplus O_2 \oplus \dots \oplus O_h]W^O.

Transformer Encoder: Layer Structure

Each encoder layer comprises:

Multi-head self-attention with residual connection:

Y = \text{MultiHead}(X^{(\ell)}),\quad X^{(\ell)} \mapsto X^{(\ell)} + Y.

Position-wise feed-forward network (FFN) applied per position with residual:

\text{FFN}(z) = W_2\sigma(W_1 z + b_1) + b_2,

resulting in:

X^{(\ell+1)} = X^{(\ell)} + Y + \text{FFN}(X^{(\ell)} + Y).

Transformer Decoder: Layer Structure and Cross-Attention

Decoder layers have:

Masked self-attention:

U = \text{MultiHead}_{\text{masked}}(D^{(\ell)}),\quad D^{(\ell)} \mapsto D^{(\ell)} + U.

Cross-attention to encoder outputs:

V = \text{MultiHead}(U,X^{(L)},X^{(L)}),\quad U \mapsto U + V.

Position-wise FFN:

D^{(\ell+1)} = D^{(\ell)} + U + V + \text{FFN}(D^{(\ell)} + U + V).

Conclusion

The Transformer is mathematically a layered composition of attention-based operations, linear transformations, and nonlinear activations, defined in vector spaces. Each layer uses probability-weighted sums (attention) and linear algebraic operations, providing a rich, abstract framework for sequence modeling.