High dimensional vector spaces

Semantic vector embeddings (layer 0) in the transformer, as well as the FFN-step in each transformer layer use high dimensional vector spaces (512+ dimensions) where the (transformed) tokens live. There are some counter-intuitive things happening if we switch from 3d-spaces to these high-dimensional spaces.

The empty space phenomenon ("the curse of dimensionality")

Uniformly of homogeniously distributed data points in high dimensional space Rd\R^d become increasingly sparse as dimension dd grows, because the volume of such spaces is so huge. This follows from the fact that the distribution of a volume of a shape in concentrates near its surface (boundary). To see sees, observe that for a hyper cube of side length 1 and hypersphere of radius 0.5, both centered at the origin, we have that Vcube=1dV_{cube}=1^d and Vsphere=πd/2Γ(d2+1)0.5dV_{sphere}=\frac{\pi^{d/2}}{\Gamma(\frac{d}{2}+1)}\cdot 0.5^d, and thus

limdVsphereVcube=0\lim_{d\rightarrow \infty}\frac{V_{sphere}}{V_{cube}}=0

This implies that the sphere itself occupies almost no volume, most space is between the sphere and the cube (that the corners of the hyper cupe, as the sphere still touches the cupe at the sides). Or in other words, if you sample from a uniformly distributed points within the hyper cupe, the probability to select a point within the sphere is almost zero:

P(x<1)0P(||x||<1)\approx 0

This fact sharpens our intuition. But let's discuss now density, by asking how big a hyper cube (or sphere, but we stick to cube here) has to be to capture ff percent of the data points. Let ll be the side length of the hypercupe, then we have (because data is uniform)

ldfl^d \approx f

The "Gaussian Annulus" Theorem

Distance concentration

Orthogonality of randomn vectors

Diagonals dominate coordinates