The Dimension Question

How many dimensions does a text embedding need?

The field has settled on conventions: 768 for BERT-scale models, 1536 for OpenAI’s ada-002, 4096 for some recent models. But these choices reflect architectural constraints, not fundamental requirements.

We investigate what happens when we scale embedding dimensions to 7680—ten times the BERT baseline.

Why Higher Dimensions?

Capacity Arguments

A $d$-dimensional embedding space can represent $\mathcal{O}(e^d)$ nearly-orthogonal vectors. For semantic search, we want documents with different meanings to map to different regions. Higher dimensions provide more room.

Information-Theoretic View

Consider a corpus of $N$ documents, each with entropy $H$ bits. A $d$-dimensional float32 embedding stores $32d$ bits. For lossless encoding:

32d \geq N \cdot H

For a corpus of 1 billion documents with 100 bits of effective information each, we need:

d \geq \frac\{10^9 \cdot 100\}\{32\} \approx 3 \times 10^9

Clearly, embeddings are lossy compression. But higher dimensions reduce the loss.

Empirical Observations

Retrieval quality on our internal benchmarks plateaued around:

768 dimensions: 71% recall@10
1536 dimensions: 78% recall@10
3072 dimensions: 83% recall@10
7680 dimensions: 87% recall@10

Diminishing returns set in, but gains persist well beyond conventional wisdom.

Training High-Dimensional Embeddings

Architecture

We use a standard transformer encoder with a projection head:


    Input --> Transformer(L=12, H=768) --> Pool --> Linear(768, 7680) --> Normalize

The projection head maps from the transformer’s hidden dimension to the embedding space. This decouples representation capacity from compute requirements.

Contrastive Learning

We train with InfoNCE loss over batches of (query, positive, negatives):

\mathcal\{L\} = -\log \frac\{\exp(q \cdot p^+ / \tau)\}\{\sum_\{i\} \exp(q \cdot p_i / \tau)\}

With temperature $\tau = 0.01$ for high-dimensional spaces (lower than typical).

Hard Negative Mining

High-dimensional spaces require harder negatives to provide gradient signal. Our mining strategy:

Retrieve top-100 candidates via approximate nearest neighbor
Filter to exclude true positives
Sample negatives weighted by similarity (harder = more likely)

This curriculum focuses training on the decision boundary.

Practical Considerations

Storage

7680-dimensional float32 embeddings require 30KB per vector. For 1 billion documents:

10^9 \times 30\text\{KB\} = 30\text\{TB\}

This is substantial but manageable with modern storage.

Quantization

We can reduce storage through quantization:

Precision	Bytes/Vector	Recall@10
float32	30,720	87.3%
float16	15,360	87.1%
int8	7,680	85.9%
binary	960	78.2%

int8 quantization provides 4x compression with minimal quality loss.

Approximate Search

Exact nearest neighbor search in 7680 dimensions is expensive. We use hierarchical navigable small world (HNSW) graphs:

Dimensions	Build Time	Query Time	Recall@10
768	1x	1x	99.2%
7680	3.2x	2.8x	98.7%

The overhead is sublinear in dimensionality due to efficient distance computations.

Benchmark Results

MS MARCO Passage Retrieval

Model	Dimensions	MRR@10	Recall@100
BM25	-	18.4	66.5
DPR	768	31.1	82.4
Contriever	768	32.8	84.1
Zen-Embed	7680	38.6	91.3

Natural Questions

Model	Dimensions	Top-20 Acc	Top-100 Acc
DPR	768	78.4	85.4
Contriever	768	81.3	88.1
Zen-Embed	7680	86.7	93.2

High-dimensional embeddings provide substantial gains on retrieval benchmarks.

Analysis

What Do Extra Dimensions Encode?

We analyze the learned embedding space through probing tasks:

Property	768d Probe Acc	7680d Probe Acc
Topic	84.2%	86.1%
Sentiment	91.3%	92.8%
Entity	67.4%	78.9%
Relation	52.1%	71.3%

The largest gains are in entity and relation encoding—fine-grained semantic properties that require more capacity.

Nearest Neighbor Analysis

For the query “What causes inflation?”, nearest neighbors at different dimensions:

768 dimensions:

What is inflation? (similar query)
How does inflation work? (similar query)
Inflation rates by country (tangential)

7680 dimensions:

Inflation is caused by… (direct answer)
The primary drivers of inflation include… (direct answer)
Central bank policies affect inflation through… (relevant detail)

Higher dimensions better distinguish queries from answers.

Recommendations

Based on our experiments:

Try 3072+ dimensions if retrieval quality matters and storage is available
Use int8 quantization for production deployments
Invest in hard negative mining to realize the benefits of capacity
Benchmark on your data —gains vary by domain

Conventional embedding sizes are conventions, not laws. Question them.

Technical details in “Scaling Embedding Dimensions for Semantic Retrieval” (2022). Model weights available on Hugging Face.

Embedding Spaces at 7680 Dimensions