Embedding-based retrieval is fast but imprecise. Cross-encoder reranking is precise but slow. The combination unlocks the best of both. Today we release the Zen Reranker, purpose-built for two-stage retrieval.

Two-Stage Retrieval

Modern retrieval pipelines typically operate in two stages:

Query -> [Embedding Retrieval] -> Top-K Candidates -> [Reranker] -> Final Results
         (fast, approximate)                          (slow, precise)

Stage 1: Bi-encoder embeddings enable fast approximate search over millions of documents. Retrieve top-100 to top-1000 candidates.

Stage 2: Cross-encoder reranker scores each candidate against the query with full attention. Reorder to get final top-10 or top-20.

The reranker sees query-document pairs together, enabling much finer relevance distinctions than independent embeddings.

The Zen Reranker

Our reranker builds on several key design decisions:

Architecture

  • Base: 330M parameter encoder-only transformer
  • Input: Concatenated query and document with separator tokens
  • Output: Single relevance score (0-1)
  • Context: 512 tokens (query + document combined)

Training Data

We curated training data from multiple sources:

  1. MS MARCO: Search query-passage pairs (positive and hard negatives)
  2. Natural Questions: Question-answer pairs from Wikipedia
  3. Synthetic pairs: LLM-generated query-document pairs with labels
  4. Human judgments: 50K expert-annotated pairs across domains

Total: 12M training pairs with 4-way relevance labels.

Training Objective

We use a listwise loss that considers the full ranking:

def listwise_loss(scores, labels):
    # Softmax over candidate scores
    probs = softmax(scores)
    # Cross-entropy with label distribution
    return -sum(labels * log(probs))

This outperforms pointwise binary classification by teaching the model to rank, not just classify.

Benchmark Results

BEIR Reranking

Reranking BM25 top-100 results:

RerankerNDCG@10Time (ms/query)
No reranking42.1-
monoT5-base49.3180
MiniLM-reranker47.845
Zen Reranker52.762

Reranking Zen Embeddings

Combined with our 7680d embeddings:

PipelineNDCG@10Latency
Zen-Embed only57.38ms
Zen-Embed + Reranker64.170ms

The two-stage pipeline achieves +12% improvement with acceptable latency overhead.

Domain-Specific Performance

DomainBM25+Zen RerankerImprovement
Scientific (SCIDOCS)15.821.4+35%
Finance (FiQA)29.638.2+29%
Covid (TREC-COVID)65.578.3+20%
Quora duplicate78.984.6+7%

Specialized domains with complex language benefit most.

Usage

Basic Usage

from zen.reranker import ZenReranker

reranker = ZenReranker.from_pretrained("zoo-labs/zen-reranker")

query = "What causes climate change?"
documents = [
    "Greenhouse gases trap heat in the atmosphere...",
    "The weather today is sunny and warm...",
    "CO2 emissions from burning fossil fuels..."
]

scores = reranker.score(query, documents)
# [0.92, 0.12, 0.87]

Integration with Retrievers

from zen.retrieval import ZenRetriever
from zen.reranker import ZenReranker

retriever = ZenRetriever("zoo-labs/zen-embed-xl")
reranker = ZenReranker.from_pretrained("zoo-labs/zen-reranker")

# Stage 1: Fast retrieval
candidates = retriever.retrieve(query, k=100)

# Stage 2: Precise reranking
reranked = reranker.rerank(query, candidates, k=10)

Batched Inference

For production workloads:

# Batch queries for efficiency
queries = ["query 1", "query 2", ...]
candidate_lists = [[docs...], [docs...], ...]

results = reranker.batch_rerank(
    queries, 
    candidate_lists,
    batch_size=32,
    k=10
)

Deployment Considerations

Hardware Requirements

  • Minimum: 4GB GPU memory
  • Recommended: 8GB+ for batched inference
  • CPU: Viable for low-throughput (<10 QPS)

Latency Optimization

  1. Batching: Process multiple query-doc pairs together
  2. Quantization: INT8 reduces latency 40% with <1% quality loss
  3. Early termination: Stop scoring when top-k is confident
  4. Caching: Cache scores for repeated query-document pairs

Scaling

For high-throughput applications:

  • Deploy multiple replicas behind load balancer
  • Use async inference with request queuing
  • Consider distilled smaller models for extreme latency requirements

Model Release

The Zen Reranker is available under Apache 2.0:

  • Hugging Face: huggingface.co/zoo-labs/zen-reranker
  • ONNX: Optimized for deployment
  • TensorRT: NVIDIA-optimized variant

Conclusion

Two-stage retrieval with a quality reranker is the pragmatic choice for production search systems. The Zen Reranker provides state-of-the-art reranking in an efficient, easy-to-deploy package.

Fast first, then precise.


Zach Kelling is a co-founder of Zoo Labs Foundation.