MSE Is Not Equal to Recall - The TurboQuant Paradox

MSE ≠ Recall: The TurboQuant Paradox

TurboQuant (ICLR 2026) introduces a compelling idea: use Lloyd-Max optimal scalar quantization after random rotation to compress vectors. The centroids are provably optimal for minimizing mean squared error on the post-rotation coordinate distribution. No training data required. Zero indexing time.

There's just one problem. Lower MSE doesn't mean better search.

The Paradox in One Table

We benchmarked TurboQuant against RaBitQ on 768-dimensional clustered synthetic data (50,000 vectors, 100 queries, L2 distance, Recall@10 vs brute-force ground truth):

Method Recall@10 MSE/vector Compression
RaBitQ 4-bit 0.387 43,406 7.6x
TurboQuant 4-bit 0.245 24,529 7.8x

TurboQuant's reconstruction error is 44% lower — and its recall is 37% worse.

This isn't a marginal effect. At the standard 4-bit operating point (~8x compression), RaBitQ retrieves the correct nearest neighbors 1.58x more often than TurboQuant, despite being objectively worse at reconstructing the original vectors.

Two Problems That Look Like One

The confusion arises because "compress a vector" sounds like a single problem. It isn't. There are two distinct objectives hiding under the same umbrella:

Reconstruction: find the compressed representation x̃ that minimizes E[‖x - x̃‖²]. TurboQuant's Lloyd-Max codebook is provably optimal for this on the post-rotation Gaussian coordinate distribution. This is the problem that MSE measures.

Distance ranking: estimate d(q, x) accurately enough to preserve the true neighbor ordering. This depends on rank correlation between estimated and true distances — not on how faithfully you can reconstruct x.

RaBitQ solves the ranking problem. TurboQuant solves the reconstruction problem. They just happen to both call themselves "vector quantization."

How the Distance Estimators Differ

RaBitQ stores per-vector calibration factors (dp_multiplier, norm) that produce distance estimates with near-perfect rank correlation to true distances. On our benchmark, RaBitQ achieves Spearman ρ ≈ 0.99 — meaning its distance estimates almost never swap the ordering of two candidates.

TurboQuant uses Asymmetric Distance Computation (ADC): rotate the query, build a per-coordinate distance table against the 16 Lloyd-Max centroids, then sum table lookups across all 768 coordinates. Even with a per-vector correction factor (which the implementation does include), estimation variance accumulates across dimensions. The result: Spearman ρ ≈ 0.87.

That 0.12-point gap in rank correlation is the entire recall difference.

The Hybrid Experiment: Proving It's the Codebook

A natural question: if TurboQuant has a better codebook (lower MSE) and RaBitQ has a better estimator (higher ρ), can we combine them?

We built a hybrid quantizer — Lloyd-Max centroids with RaBitQ-style calibrated distance estimation:

dist = q² + norm_x² - 2 · norm_x · dp_multiplier · Σᵢ q_rot[i] · centroids[codes[i]]

The result:

Method Recall@10 Spearman ρ
RaBitQ 4-bit 0.387 ~0.99
Hybrid (Lloyd-Max + RaBitQ estimator) 0.245 0.873
TurboQuant MSE 4-bit 0.245 0.873

The hybrid is mathematically identical to TurboQuant. The existing TurboQuant implementation already uses a calibrated estimator (its ip_factor field is dp_multiplier by another name). Distance values between hybrid and TurboQuant differ by at most 1.56×10⁻² — floating-point noise.

This rules out the estimator formula as the culprit. The gap comes from the codebook itself.

Why MSE-Optimal Codebooks Hurt Search

The explanation lies in cross-coordinate error cancellation. When you estimate an inner product by summing 768 per-coordinate approximations, the accuracy of the sum depends not just on the per-coordinate error magnitudes (which MSE captures) but on how those errors correlate with each other across coordinates (which MSE does not capture).

Lloyd-Max distributes centroid density to minimize expected squared error per coordinate. But this density allocation does not minimize the variance of the sum of estimation errors. The cross-coordinate cancellation properties of the codebook are a global property — one that per-coordinate MSE is blind to.

RaBitQ's grid codebook with per-vector rescaling achieves tighter concentration of the inner product estimator, even though each individual coordinate is reconstructed less faithfully. The errors cancel better in aggregate.

MSE-optimality and distance-estimation-optimality are not merely different objectives — they are adversarial at multi-bit rates.

QJL Makes It Worse

TurboQuant's second stage (QJL) adds a 1-bit sign quantization of the projected residual to produce an unbiased inner product estimator. For search, this is catastrophic:

Config Recall@10
TQ-MSE 4-bit (4 bits for codebook) 0.245
TQ-Full 4-bit (3 bits codebook + 1 bit QJL) 0.022

Giving one bit to QJL steals it from the scalar quantizer, degrading per-coordinate quantization far more than the unbiased estimator can compensate. This pattern holds at every bit budget tested.

QJL was designed for KV cache attention computation, where unbiased inner product estimation directly improves output quality. For L2 nearest neighbor search, it's the wrong tool.

Where TurboQuant Actually Wins

This isn't a "TurboQuant is bad" story. It's a "TurboQuant is aimed at the wrong target for search" story. In domains where reconstruction fidelity is the objective, TurboQuant delivers:

KV cache compression for LLMs. On a Qwen3-32B model (Q4_K_M weights, RTX 4090), TurboQuant's turbo3 KV cache format extends the context ceiling from 16K tokens (FP16) to 80K tokens — a 5x increase — with only +1.9% perplexity overhead (8.00 → 8.15 on Wikitext-2). With Q3_K_M model weights, it pushes to 131K context at 43.7 tokens/second decode speed, matching FP16 throughput.

Token embedding compression for MaxSim reranking. On MultiHop-RAG, compressing 6.8M token embeddings from float32 to 4-bit TurboQuant (3.78x compression, 16.5 GB → 4.4 GB) costs only 1.0 point of Hits@4 (72.2% → 71.2%). The quantized MaxSim still beats the best non-MaxSim method by 1.8 points. The aggregation across query tokens absorbs per-comparison quantization noise.

Extreme compression. At 1-bit (~30x compression), TurboQuant beats RaBitQ (0.055 vs 0.036 Recall@10, 1.53x higher). The Lloyd-Max centroid (±0.0288) is a better reconstruction point than RaBitQ's sign centroid (±1/√768 = ±0.0361) when only two levels are available.

Implications for Practitioners

If you're building ANN search: use RaBitQ. It ships in FAISS 1.13.2, supports IVF composition (0.881 Recall@10 with IVF256 at 4-bit), and has FastScan SIMD paths. Don't be seduced by TurboQuant's lower MSE numbers — they measure the wrong thing.

If you're compressing KV caches: use TurboQuant. The reconstruction fidelity is exactly what attention computation needs, and the 5x context extension at <2% quality cost is a genuine breakthrough for memory-constrained inference.

If you're doing late-interaction retrieval (ColBERT, MaxSim): TurboQuant is viable for token storage compression. The aggregation across query tokens smooths out per-comparison errors. You trade 3.78x storage savings for about 1 point of retrieval quality.

Don't use MSE to evaluate quantization for search. This is the meta-lesson. Rank correlation (Spearman ρ) or recall@k against ground truth are the right metrics. A method with higher MSE can — and empirically does — produce better search results.

Methodology Notes

Our TurboQuant implementation is Python/NumPy following the algorithm specification from three community repos. RaBitQ uses FAISS 1.13.2's native C++ implementation with AVX2/AVX512 SIMD. Search speed comparisons are invalid due to this implementation asymmetry — but recall and distortion metrics are directly comparable (same data, same evaluation). The synthetic data (768d, 64 Gaussian clusters) may not capture all properties of real embedding distributions; results on production embeddings could differ.

All code and data are available at turboquant_experiments. The experimental report with full tables is in results/turboquant-rabitq-experimental-report.md.


This post summarizes independent experimental work benchmarking TurboQuant (ICLR 2026) against RaBitQ (SIGMOD 2024/2025) for vector search. TurboQuant's core contribution — Lloyd-Max quantization after random rotation — is a genuine advance for reconstruction-sensitive applications. The finding that this degrades search ranking quality relative to RaBitQ's estimator-optimized approach corroborates concerns raised by the RaBitQ authors (Gao and Long, NTU) about TurboQuant's search claims.


This article was scaffolded with backblog.

links

social