Embeddings

7680-Dimensional Embeddings: More Dimensions, Better Retrieval

Embedding dimensions have standardized around powers of two: 768, 1536, occasionally 4096. We asked a simple question: what happens if we go bigger? The answer surprised us. Background: Why Dimensions Matter Text embeddings map variable-length sequences to fixed-dimensional vectors. These vectors enable semantic similarity search, clustering, and retrieval. The dimension count determines the vector space’s capacity. Lower dimensions mean: Smaller storage requirements Faster similarity computations Potential information loss Higher dimensions mean:...

December 5, 2022 · 3 min · 507 words · Zach Kelling

Embedding Spaces at 7680 Dimensions

The Dimension Question How many dimensions does a text embedding need? The field has settled on conventions: 768 for BERT-scale models, 1536 for OpenAI’s ada-002, 4096 for some recent models. But these choices reflect architectural constraints, not fundamental requirements. We investigate what happens when we scale embedding dimensions to 7680—ten times the BERT baseline. Why Higher Dimensions? Capacity Arguments A $d$-dimensional embedding space can represent $\mathcal{O}(e^d)$ nearly-orthogonal vectors. For semantic search, we want documents with different meanings to map to different regions....

December 5, 2022 · 4 min · 647 words · Zach Kelling