Embedding dimensions have standardized around powers of two: 768, 1536, occasionally 4096. We asked a simple question: what happens if we go bigger? The answer surprised us.

Background: Why Dimensions Matter

Text embeddings map variable-length sequences to fixed-dimensional vectors. These vectors enable semantic similarity search, clustering, and retrieval. The dimension count determines the vector space’s capacity.

Lower dimensions mean:

  • Smaller storage requirements
  • Faster similarity computations
  • Potential information loss

Higher dimensions mean:

  • More expressive capacity
  • Larger memory footprint
  • Computational overhead

The conventional wisdom holds that returns diminish quickly past 1024-2048 dimensions. Our experiments challenge this.

Experimental Setup

We trained a series of embedding models with identical architectures except for output dimension:

ModelDimensionsParameters
Zen-Embed-S768110M
Zen-Embed-M1536125M
Zen-Embed-L3072155M
Zen-Embed-XL7680230M

Training data: 1.2B text pairs with contrastive learning objective.

Results

Retrieval Benchmarks

BEIR (Benchmarking IR) results across 15 datasets:

ModelNDCG@10Recall@100MRR
Zen-Embed-S48.271.345.1
Zen-Embed-M51.775.848.9
Zen-Embed-L54.179.252.3
Zen-Embed-XL57.383.655.8

The improvements continue well past conventional dimension counts.

Scaling Analysis

Plotting performance against log(dimensions) reveals near-linear scaling:

$$\text{NDCG@10} \approx 0.12 \cdot \log_2(d) + 37.4$$

This suggests embedding capacity remains a bottleneck even at high dimensions.

Per-Domain Breakdown

The benefits are not uniform across domains:

Domain768d7680dImprovement
Scientific42.154.7+30%
Legal38.951.2+32%
Conversational52.355.1+5%
News49.853.4+7%

Technical and specialized domains benefit most. Everyday conversational content sees smaller gains.

Interpretability

Higher dimensions don’t just improve metrics; they enable finer distinctions. Analysis of the 7680d space shows:

  • Cleaner clusters: Topic boundaries are sharper
  • Preserved nuance: Similar but distinct concepts remain separable
  • Hierarchical structure: Taxonomic relationships emerge naturally

The Efficiency Question

7680 dimensions cost more to store and search. Is it worth it?

Storage

DimensionsBytes per Vector1M Vectors
7683,0722.9 GB
768030,72029.3 GB

10x storage for higher dimensions. Significant but manageable with modern hardware.

Search Latency

Exact search scales linearly with dimensions. But approximate methods (HNSW, IVF) show sublinear scaling:

DimensionsExact (ms)HNSW (ms)IVF-PQ (ms)
76812.30.80.3
7680118.72.10.7

With appropriate indexing, 7680d search remains practical.

Compression

Quantization recovers much of the efficiency loss:

  • INT8: 4x compression, <1% quality loss
  • Binary: 32x compression, 5% quality loss
  • Product Quantization: 16x compression, 2% quality loss

Practical Recommendations

Based on our experiments:

  1. If retrieval quality matters most: Use 7680d with HNSW indexing
  2. If storage is constrained: Use 7680d with INT8 quantization (still beats 768d float32)
  3. For conversational applications: 1536d is sufficient
  4. For technical/specialized domains: Higher dimensions provide outsized benefits

Release

We’re releasing the Zen-Embed family:

  • Zen-Embed-S (768d): Free, MIT license
  • Zen-Embed-M (1536d): Free, MIT license
  • Zen-Embed-L (3072d): Free, MIT license
  • Zen-Embed-XL (7680d): Free, MIT license

All models available on Hugging Face: huggingface.co/zoo-labs

What This Means

The embedding dimension race isn’t over. There’s room to improve retrieval quality by increasing capacity. As hardware improves and indexing methods advance, higher-dimensional embeddings become increasingly practical.

More dimensions, better retrieval. Sometimes the simple approach works.


Zach Kelling is a co-founder of Zoo Labs Foundation.