The Dimension Question
How many dimensions does a text embedding need?
The field has settled on conventions: 768 for BERT-scale models, 1536 for OpenAI’s ada-002, 4096 for some recent models. But these choices reflect architectural constraints, not fundamental requirements.
We investigate what happens when we scale embedding dimensions to 7680—ten times the BERT baseline.
Why Higher Dimensions?
Capacity Arguments
A $d$-dimensional embedding space can represent $\mathcal{O}(e^d)$ nearly-orthogonal vectors. For semantic search, we want documents with different meanings to map to different regions. Higher dimensions provide more room.
Information-Theoretic View
Consider a corpus of $N$ documents, each with entropy $H$ bits. A $d$-dimensional float32 embedding stores $32d$ bits. For lossless encoding:
$$32d \geq N \cdot H$$
For a corpus of 1 billion documents with 100 bits of effective information each, we need:
$$d \geq \frac{10^9 \cdot 100}{32} \approx 3 \times 10^9$$
Clearly, embeddings are lossy compression. But higher dimensions reduce the loss.
Empirical Observations
Retrieval quality on our internal benchmarks plateaued around:
- 768 dimensions: 71% recall@10
- 1536 dimensions: 78% recall@10
- 3072 dimensions: 83% recall@10
- 7680 dimensions: 87% recall@10
Diminishing returns set in, but gains persist well beyond conventional wisdom.
Training High-Dimensional Embeddings
Architecture
We use a standard transformer encoder with a projection head:
Input --> Transformer(L=12, H=768) --> Pool --> Linear(768, 7680) --> Normalize
The projection head maps from the transformer’s hidden dimension to the embedding space. This decouples representation capacity from compute requirements.
Contrastive Learning
We train with InfoNCE loss over batches of (query, positive, negatives):
$$\mathcal{L} = -\log \frac{\exp(q \cdot p^+ / \tau)}{\sum_{i} \exp(q \cdot p_i / \tau)}$$
With temperature $\tau = 0.01$ for high-dimensional spaces (lower than typical).
Hard Negative Mining
High-dimensional spaces require harder negatives to provide gradient signal. Our mining strategy:
- Retrieve top-100 candidates via approximate nearest neighbor
- Filter to exclude true positives
- Sample negatives weighted by similarity (harder = more likely)
This curriculum focuses training on the decision boundary.
Practical Considerations
Storage
7680-dimensional float32 embeddings require 30KB per vector. For 1 billion documents:
$$10^9 \times 30\text{KB} = 30\text{TB}$$
This is substantial but manageable with modern storage.
Quantization
We can reduce storage through quantization:
| Precision | Bytes/Vector | Recall@10 |
|---|---|---|
| float32 | 30,720 | 87.3% |
| float16 | 15,360 | 87.1% |
| int8 | 7,680 | 85.9% |
| binary | 960 | 78.2% |
int8 quantization provides 4x compression with minimal quality loss.
Approximate Search
Exact nearest neighbor search in 7680 dimensions is expensive. We use hierarchical navigable small world (HNSW) graphs:
| Dimensions | Build Time | Query Time | Recall@10 |
|---|---|---|---|
| 768 | 1x | 1x | 99.2% |
| 7680 | 3.2x | 2.8x | 98.7% |
The overhead is sublinear in dimensionality due to efficient distance computations.
Benchmark Results
MS MARCO Passage Retrieval
| Model | Dimensions | MRR@10 | Recall@100 |
|---|---|---|---|
| BM25 | - | 18.4 | 66.5 |
| DPR | 768 | 31.1 | 82.4 |
| Contriever | 768 | 32.8 | 84.1 |
| Zen-Embed | 7680 | 38.6 | 91.3 |
Natural Questions
| Model | Dimensions | Top-20 Acc | Top-100 Acc |
|---|---|---|---|
| DPR | 768 | 78.4 | 85.4 |
| Contriever | 768 | 81.3 | 88.1 |
| Zen-Embed | 7680 | 86.7 | 93.2 |
High-dimensional embeddings provide substantial gains on retrieval benchmarks.
Analysis
What Do Extra Dimensions Encode?
We analyze the learned embedding space through probing tasks:
| Property | 768d Probe Acc | 7680d Probe Acc |
|---|---|---|
| Topic | 84.2% | 86.1% |
| Sentiment | 91.3% | 92.8% |
| Entity | 67.4% | 78.9% |
| Relation | 52.1% | 71.3% |
The largest gains are in entity and relation encoding—fine-grained semantic properties that require more capacity.
Nearest Neighbor Analysis
For the query “What causes inflation?”, nearest neighbors at different dimensions:
768 dimensions:
- What is inflation? (similar query)
- How does inflation work? (similar query)
- Inflation rates by country (tangential)
7680 dimensions:
- Inflation is caused by… (direct answer)
- The primary drivers of inflation include… (direct answer)
- Central bank policies affect inflation through… (relevant detail)
Higher dimensions better distinguish queries from answers.
Recommendations
Based on our experiments:
- Try 3072+ dimensions if retrieval quality matters and storage is available
- Use int8 quantization for production deployments
- Invest in hard negative mining to realize the benefits of capacity
- Benchmark on your data—gains vary by domain
Conventional embedding sizes are conventions, not laws. Question them.
Technical details in “Scaling Embedding Dimensions for Semantic Retrieval” (2022). Model weights available on Hugging Face.