Math description (ChatGpt)

Mathematical Formulation of the Transformer Architecture

The Transformer architecture can be described in abstract mathematical terms by focusing on its core mechanism: the attention operation. At a high level, an attention function maps a given query vector and a set of key-value vector pairs to an output vector. The output is computed as a weighted combination (typically a convex combination) of the value vectors, where each weight is determined by a compatibility function comparing the query with the corresponding key. This mechanism allows the model to selectively focus on different parts of an input sequence (or another sequence) when processing information.

The Attention Mechanism: Abstract Formulation

Let QQ, KK, and VV be vector spaces for queries, keys, and values respectively. Given a query qQq \in Q, keys kiKk_i \in K, and associated values viVv_i \in V, the attention mechanism defines the output oo as:

o=i=1Nf(q,ki)g(vi),o = \sum_{i=1}^N f(q, k_i)\cdot g(v_i),

where f:Q×KRf: Q \times K \to \mathbb{R} is a weighting function and g:VVg: V \to V is often identity.

Typically, f(q,k)f(q,k) is a normalized scoring function such as:

f(q,ki)=exp(a(q,ki))j=1Nexp(a(q,kj)),f(q, k_i) = \frac{\exp(a(q,k_i))}{\sum_{j=1}^N \exp(a(q,k_j))},

where a(q,k)Ra(q,k) \in \mathbb{R} measures compatibility, leading to a weighted expectation:

o=Eisoftmax(a(q,K))[vi].o = \mathbb{E}_{i \sim \text{softmax}(a(q,K))}[v_i].

Scaled Dot-Product Attention

The Transformer uses the scaled dot-product attention, defined as:

a(q,k)=q,kd,a(q, k) = \frac{\langle q, k\rangle}{\sqrt{d}},

where dd is the dimension of vectors. The output is computed as:

o=i=1Nexp(q,kid)j=1Nexp(q,kjd)vi.o = \sum_{i=1}^N \frac{\exp(\frac{\langle q, k_i\rangle}{\sqrt{d}})}{\sum_{j=1}^N \exp(\frac{\langle q, k_j\rangle}{\sqrt{d}})} v_i.

Matrix formulation:

Attention(Q,K,V)=softmax(QKTd)V.\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V.

Self-Attention and Cross-Attention

Y=Attention(X,X,X).Y = \text{Attention}(X,X,X). CrossAttn(d,E)=i=1NEf(d,ei)ei.\text{CrossAttn}(d,E) = \sum_{i=1}^{N_E} f(d,e_i)e_i.

Masked self-attention preserves causality in the decoder:

f(q,ki)=0if i>j.f(q,k_i) = 0 \quad \text{if } i > j.

Multi-Head Attention

Multi-head attention operates several attention layers in parallel. For each head jj:

Qj=XWjQ,Kj=XWjK,Vj=XWjV,Q_j = XW_j^Q,\quad K_j = XW_j^K,\quad V_j = XW_j^V,

and

Oj=Attention(Qj,Kj,Vj).O_j = \text{Attention}(Q_j,K_j,V_j).

Output is concatenated and projected back:

MultiHead(X)=[O1O2Oh]WO.\text{MultiHead}(X) = [O_1 \oplus O_2 \oplus \dots \oplus O_h]W^O.

Transformer Encoder: Layer Structure

Each encoder layer comprises:

  1. Multi-head self-attention with residual connection:
Y=MultiHead(X()),X()X()+Y.Y = \text{MultiHead}(X^{(\ell)}),\quad X^{(\ell)} \mapsto X^{(\ell)} + Y.
  1. Position-wise feed-forward network (FFN) applied per position with residual:
FFN(z)=W2σ(W1z+b1)+b2,\text{FFN}(z) = W_2\sigma(W_1 z + b_1) + b_2,

resulting in:

X(+1)=X()+Y+FFN(X()+Y).X^{(\ell+1)} = X^{(\ell)} + Y + \text{FFN}(X^{(\ell)} + Y).

Transformer Decoder: Layer Structure and Cross-Attention

Decoder layers have:

  1. Masked self-attention:
U=MultiHeadmasked(D()),D()D()+U.U = \text{MultiHead}_{\text{masked}}(D^{(\ell)}),\quad D^{(\ell)} \mapsto D^{(\ell)} + U.
  1. Cross-attention to encoder outputs:
V=MultiHead(U,X(L),X(L)),UU+V.V = \text{MultiHead}(U,X^{(L)},X^{(L)}),\quad U \mapsto U + V.
  1. Position-wise FFN:
D(+1)=D()+U+V+FFN(D()+U+V).D^{(\ell+1)} = D^{(\ell)} + U + V + \text{FFN}(D^{(\ell)} + U + V).

Conclusion

The Transformer is mathematically a layered composition of attention-based operations, linear transformations, and nonlinear activations, defined in vector spaces. Each layer uses probability-weighted sums (attention) and linear algebraic operations, providing a rich, abstract framework for sequence modeling.