High dimensional vector spaces

Semantic vector embeddings (layer 0) in the transformer, as well as the FFN-step in each transformer layer use high dimensional vector spaces (512+ dimensions) where the (transformed) tokens live. There are some counter-intuitive things happening if we switch from 3d-spaces to these high-dimensional spaces.

The empty space phenomenon ("the curse of dimensionality")

Uniformly of homogeniously distributed data points in high dimensional space $\R^d$ become increasingly sparse as dimension $d$ grows, because the volume of such spaces is so huge. This follows from the fact that the distribution of a volume of a shape in concentrates near its surface (boundary). To see sees, observe that for a hyper cube of side length 1 and hypersphere of radius 0.5, both centered at the origin, we have that $V_{cube}=1^d$ and $V_{sphere}=\frac{\pi^{d/2}}{\Gamma(\frac{d}{2}+1)}\cdot 0.5^d$ , and thus

\lim_{d\rightarrow \infty}\frac{V_{sphere}}{V_{cube}}=0

This implies that the sphere itself occupies almost no volume, most space is between the sphere and the cube (that the corners of the hyper cupe, as the sphere still touches the cupe at the sides). Or in other words, if you sample from a uniformly distributed points within the hyper cupe, the probability to select a point within the sphere is almost zero:

P(||x||<1)\approx 0

This fact sharpens our intuition. But let's discuss now density, by asking how big a hyper cube (or sphere, but we stick to cube here) has to be to capture $f$ percent of the data points. Let $l$ be the side length of the hypercupe, then we have (because data is uniform)

l^d \approx f

High dimensional vector spaces

The empty space phenomenon ("the curse of dimensionality")

The "Gaussian Annulus" Theorem

Distance concentration

Orthogonality of randomn vectors

Diagonals dominate coordinates