Zen Reranker: Two-Stage Retrieval Done Right — Zen LM Blog

Embedding-based retrieval is fast but imprecise. Cross-encoder reranking is precise but slow. The combination unlocks the best of both. Today we release the Zen Reranker, purpose-built for two-stage retrieval.

Two-Stage Retrieval

Modern retrieval pipelines typically operate in two stages:


    Query -> [Embedding Retrieval] -> Top-K Candidates -> [Reranker] -> Final Results
             (fast, approximate)                          (slow, precise)

Stage 1 : Bi-encoder embeddings enable fast approximate search over millions of documents. Retrieve top-100 to top-1000 candidates.

Stage 2 : Cross-encoder reranker scores each candidate against the query with full attention. Reorder to get final top-10 or top-20.

The reranker sees query-document pairs together, enabling much finer relevance distinctions than independent embeddings.

The Zen Reranker

Our reranker builds on several key design decisions:

Architecture

Base : 330M parameter encoder-only transformer
Input : Concatenated query and document with separator tokens
Output : Single relevance score (0-1)
Context : 512 tokens (query + document combined)

Training Data

We curated training data from multiple sources:

MS MARCO : Search query-passage pairs (positive and hard negatives)
Natural Questions : Question-answer pairs from Wikipedia
Synthetic pairs : LLM-generated query-document pairs with labels
Human judgments : 50K expert-annotated pairs across domains

Total: 12M training pairs with 4-way relevance labels.

Training Objective

We use a listwise loss that considers the full ranking:


    def listwise_loss(scores, labels):
        # Softmax over candidate scores
        probs = softmax(scores)
        # Cross-entropy with label distribution
        return -sum(labels * log(probs))

This outperforms pointwise binary classification by teaching the model to rank, not just classify.

Benchmark Results

BEIR Reranking

Reranking BM25 top-100 results:

Reranker	NDCG@10	Time (ms/query)
No reranking	42.1	-
monoT5-base	49.3	180
MiniLM-reranker	47.8	45
Zen Reranker	52.7	62

Reranking Zen Embeddings

Combined with our 7680d embeddings:

Pipeline	NDCG@10	Latency
Zen-Embed only	57.3	8ms
Zen-Embed + Reranker	64.1	70ms

The two-stage pipeline achieves +12% improvement with acceptable latency overhead.

Domain-Specific Performance

Domain	BM25	+Zen Reranker	Improvement
Scientific (SCIDOCS)	15.8	21.4	+35%
Finance (FiQA)	29.6	38.2	+29%
Covid (TREC-COVID)	65.5	78.3	+20%
Quora duplicate	78.9	84.6	+7%

Specialized domains with complex language benefit most.

Usage

Basic Usage


    from zen.reranker import ZenReranker
    
    reranker = ZenReranker.from_pretrained("zoo-labs/zen-reranker")
    
    query = "What causes climate change?"
    documents = [
        "Greenhouse gases trap heat in the atmosphere...",
        "The weather today is sunny and warm...",
        "CO2 emissions from burning fossil fuels..."
    ]
    
    scores = reranker.score(query, documents)
    # [0.92, 0.12, 0.87]

Integration with Retrievers


    from zen.retrieval import ZenRetriever
    from zen.reranker import ZenReranker
    
    retriever = ZenRetriever("zoo-labs/zen-embed-xl")
    reranker = ZenReranker.from_pretrained("zoo-labs/zen-reranker")
    
    # Stage 1: Fast retrieval
    candidates = retriever.retrieve(query, k=100)
    
    # Stage 2: Precise reranking
    reranked = reranker.rerank(query, candidates, k=10)

Batched Inference

For production workloads:


    # Batch queries for efficiency
    queries = ["query 1", "query 2", ...]
    candidate_lists = [[docs...], [docs...], ...]
    
    results = reranker.batch_rerank(
        queries, 
        candidate_lists,
        batch_size=32,
        k=10
    )

Deployment Considerations

Hardware Requirements

Minimum : 4GB GPU memory
Recommended : 8GB+ for batched inference
CPU : Viable for low-throughput (<10 QPS)

Latency Optimization

Batching : Process multiple query-doc pairs together
Quantization : INT8 reduces latency 40% with <1% quality loss
Early termination : Stop scoring when top-k is confident
Caching : Cache scores for repeated query-document pairs

Scaling

For high-throughput applications:

Deploy multiple replicas behind load balancer
Use async inference with request queuing
Consider distilled smaller models for extreme latency requirements

Model Release

The Zen Reranker is available under Apache 2.0:

Hugging Face : huggingface.co/zoo-labs/zen-reranker
ONNX : Optimized for deployment
TensorRT : NVIDIA-optimized variant

Conclusion

Two-stage retrieval with a quality reranker is the pragmatic choice for production search systems. The Zen Reranker provides state-of-the-art reranking in an efficient, easy-to-deploy package.

Fast first, then precise.

Zach Kelling is a co-founder of Zoo Labs Foundation.