The Dimension Question

How many dimensions does a text embedding need?

The field has settled on conventions: 768 for BERT-scale models, 1536 for OpenAI’s ada-002, 4096 for some recent models. But these choices reflect architectural constraints, not fundamental requirements.

We investigate what happens when we scale embedding dimensions to 7680—ten times the BERT baseline.

Why Higher Dimensions?

Capacity Arguments

A $d$-dimensional embedding space can represent $\mathcal{O}(e^d)$ nearly-orthogonal vectors. For semantic search, we want documents with different meanings to map to different regions. Higher dimensions provide more room.

Information-Theoretic View

Consider a corpus of $N$ documents, each with entropy $H$ bits. A $d$-dimensional float32 embedding stores $32d$ bits. For lossless encoding:

$$32d \geq N \cdot H$$

For a corpus of 1 billion documents with 100 bits of effective information each, we need:

$$d \geq \frac{10^9 \cdot 100}{32} \approx 3 \times 10^9$$

Clearly, embeddings are lossy compression. But higher dimensions reduce the loss.

Empirical Observations

Retrieval quality on our internal benchmarks plateaued around:

  • 768 dimensions: 71% recall@10
  • 1536 dimensions: 78% recall@10
  • 3072 dimensions: 83% recall@10
  • 7680 dimensions: 87% recall@10

Diminishing returns set in, but gains persist well beyond conventional wisdom.

Training High-Dimensional Embeddings

Architecture

We use a standard transformer encoder with a projection head:

Input --> Transformer(L=12, H=768) --> Pool --> Linear(768, 7680) --> Normalize

The projection head maps from the transformer’s hidden dimension to the embedding space. This decouples representation capacity from compute requirements.

Contrastive Learning

We train with InfoNCE loss over batches of (query, positive, negatives):

$$\mathcal{L} = -\log \frac{\exp(q \cdot p^+ / \tau)}{\sum_{i} \exp(q \cdot p_i / \tau)}$$

With temperature $\tau = 0.01$ for high-dimensional spaces (lower than typical).

Hard Negative Mining

High-dimensional spaces require harder negatives to provide gradient signal. Our mining strategy:

  1. Retrieve top-100 candidates via approximate nearest neighbor
  2. Filter to exclude true positives
  3. Sample negatives weighted by similarity (harder = more likely)

This curriculum focuses training on the decision boundary.

Practical Considerations

Storage

7680-dimensional float32 embeddings require 30KB per vector. For 1 billion documents:

$$10^9 \times 30\text{KB} = 30\text{TB}$$

This is substantial but manageable with modern storage.

Quantization

We can reduce storage through quantization:

PrecisionBytes/VectorRecall@10
float3230,72087.3%
float1615,36087.1%
int87,68085.9%
binary96078.2%

int8 quantization provides 4x compression with minimal quality loss.

Exact nearest neighbor search in 7680 dimensions is expensive. We use hierarchical navigable small world (HNSW) graphs:

DimensionsBuild TimeQuery TimeRecall@10
7681x1x99.2%
76803.2x2.8x98.7%

The overhead is sublinear in dimensionality due to efficient distance computations.

Benchmark Results

MS MARCO Passage Retrieval

ModelDimensionsMRR@10Recall@100
BM25-18.466.5
DPR76831.182.4
Contriever76832.884.1
Zen-Embed768038.691.3

Natural Questions

ModelDimensionsTop-20 AccTop-100 Acc
DPR76878.485.4
Contriever76881.388.1
Zen-Embed768086.793.2

High-dimensional embeddings provide substantial gains on retrieval benchmarks.

Analysis

What Do Extra Dimensions Encode?

We analyze the learned embedding space through probing tasks:

Property768d Probe Acc7680d Probe Acc
Topic84.2%86.1%
Sentiment91.3%92.8%
Entity67.4%78.9%
Relation52.1%71.3%

The largest gains are in entity and relation encoding—fine-grained semantic properties that require more capacity.

Nearest Neighbor Analysis

For the query “What causes inflation?”, nearest neighbors at different dimensions:

768 dimensions:

  1. What is inflation? (similar query)
  2. How does inflation work? (similar query)
  3. Inflation rates by country (tangential)

7680 dimensions:

  1. Inflation is caused by… (direct answer)
  2. The primary drivers of inflation include… (direct answer)
  3. Central bank policies affect inflation through… (relevant detail)

Higher dimensions better distinguish queries from answers.

Recommendations

Based on our experiments:

  1. Try 3072+ dimensions if retrieval quality matters and storage is available
  2. Use int8 quantization for production deployments
  3. Invest in hard negative mining to realize the benefits of capacity
  4. Benchmark on your data—gains vary by domain

Conventional embedding sizes are conventions, not laws. Question them.


Technical details in “Scaling Embedding Dimensions for Semantic Retrieval” (2022). Model weights available on Hugging Face.