What is a Residual?
PolarQuant compresses a vector and reconstructs it as x̂. The reconstruction is good — but not perfect. The difference between the original and reconstruction is called the residual:
It's not a single number like MSE. It's a full vector — the leftover information that quantization failed to capture at every coordinate.
The notebook found that after 2-bit PolarQuant, the residual carries 56.5% of the original norm. More than half the information survives in the residual. Naturally, the question is — can we compress and store the residual too?
Notice — as you increase bits, the residual gets smaller (less leftover). With 1-bit, the residual is enormous. With 4-bit, it's tiny. This directly shows you how much information each bit-width captures.
QJL: Compressing the Residual with 1 Bit
The residual is a 128-dimensional float vector. To compress it with just 1 bit per coordinate, you can only store two values: +1 or −1.
So QJL stores only the sign of each residual coordinate — plus one global norm value:
# What gets stored: residual = [+0.31, -0.08, +0.19, -0.02, ...] signs = [ +1, -1, +1, -1, ...] ← 1 bit each ‖residual‖ = 0.38 ← 1 float (32 bits) # Reconstruction: r_hat = scale_factor × ‖r‖ × Sᵀ @ signs = 0.0098 × 0.38 × Sᵀ @ signs ≈ [+0.09, -0.09, +0.09, -0.09, ...] ← same magnitude everywhere!
The result: large residual coordinates are under-corrected, small ones are over-corrected. This random mismatch is exactly the variance QJL introduces.
Bias vs Variance — The Archer Analogy
QJL reduces bias — the average error across many vectors. But it increases variance — the spread of errors for individual coordinates. Understanding why this matters requires understanding these two concepts.
The Archer Analogy
Imagine two archers shooting at a bullseye. The bullseye is the true attention score.
Archer A (PolarQuant only): All arrows land slightly to the left — consistent error. But they're clustered together. Softmax sees similar errors everywhere — stays stable.
Archer B (PolarQuant + QJL): Arrows are centered on average — but scattered randomly. One arrow might fly far to the right. Softmax exponentially amplifies that one outlier.
For attention: low variance beats low bias. A consistent small error is far less damaging than a random large one — because softmax is exponential, not linear.
Why Softmax Amplifies Variance
Softmax converts attention scores into probabilities. The formula is exponential:
This means a small increase in one score causes a disproportionately large increase in its probability weight — at the expense of all others.
Drag the slider below to add noise to the top attention score and watch what happens to the weights:
Notice how even a small noise of +0.2 on the top score dramatically changes the weight distribution. This is the danger of high variance — softmax turns small random errors into large attention shifts.
The turbo4 Resurrection
This is one of the most interesting stories in the TurboQuant project. The original turbo4 was broken by 7 implementation bugs (perplexity = 679, completely unusable). After fixing the bugs:
| Version | Strategy | Centroids | PPL | Status |
|---|---|---|---|---|
| Original turbo4 (buggy) | PQ3 + QJL1 | 8 | 679 | Broken |
| turbo4 (bugs fixed) | PQ3 + QJL1 | 8 | 6.19 | Working |
| turbo4 (QJL disabled) | PQ3 only | 8 | 6.18 | QJL was hurting! |
| New turbo4 | PQ4 only | 16 | 6.125 | Best ✓ |
The fix journey revealed something fundamental: spending all 4 bits on PolarQuant centroids (16 centroids) beats spending 3 bits on PolarQuant + 1 bit on QJL residual (8 centroids + signs).
Why More Centroids Wins
Going from 8 → 16 centroids directly halves the slot width:
And critically — this smaller error is consistent across every coordinate. No random noise. No variance explosion through softmax.
The Lesson
Three independent research groups (scos-lab, Arclabs/YATQ, AmesianX) all confirmed the same finding:
For attention workloads — more centroids always beats error correction. Spend every bit on PolarQuant. Never split bits between PolarQuant and QJL.
| Approach | Bits | Centroids | QJL? | Attention Quality |
|---|---|---|---|---|
| turbo3 | 3 | 4 (PQ2) | Yes | Good enough |
| Old turbo4 | 4 | 8 (PQ3) | Yes | Worse (variance) |
| New turbo4 | 4 | 16 (PQ4) | No | Best ✓ |
QJL's mathematical guarantee — unbiased inner products — doesn't translate to better attention quality. Theory said it should help. Practice said it hurts. The exponential nature of softmax is the reason.
The Complete Chain — All 4 Modules
Module 1: Naive quantization fails → outliers stretch range → slots wasted Module 2: Rotate first (WHT + random signs) → outliers spread → range collapses → coords follow N(0, 1/d) Module 3: Exploit N(0,1/d) → Lloyd-Max optimal centroids → PolarQuant = 7× compression, 0.94 cosine Module 4: More centroids > error correction → QJL adds variance → softmax amplifies it → new turbo4 = pure 4-bit PolarQuant → beats q4_0