Module 4 · TurboQuant

The QJL Lesson

Why a clever residual correction made things worse — and what that taught us about softmax

scroll to explore ↓
§ 01

What is a Residual?

PolarQuant compresses a vector and reconstructs it as . The reconstruction is good — but not perfect. The difference between the original and reconstruction is called the residual:

residual = x − x̂

It's not a single number like MSE. It's a full vector — the leftover information that quantization failed to capture at every coordinate.

Think of PolarQuant as a photographer who takes a photo with a slightly blurry lens. The photo (x̂) is close but not perfect. The residual is the blur — the exact difference between the original scene and the photo.

The notebook found that after 2-bit PolarQuant, the residual carries 56.5% of the original norm. More than half the information survives in the residual. Naturally, the question is — can we compress and store the residual too?

Residual = Original − Reconstructed (interactive)
bits: 2 bits
Residual / Original Norm
MSE
Info Captured

Notice — as you increase bits, the residual gets smaller (less leftover). With 1-bit, the residual is enormous. With 4-bit, it's tiny. This directly shows you how much information each bit-width captures.

§ 02

QJL: Compressing the Residual with 1 Bit

The residual is a 128-dimensional float vector. To compress it with just 1 bit per coordinate, you can only store two values: +1 or −1.

So QJL stores only the sign of each residual coordinate — plus one global norm value:

# What gets stored:
residual  = [+0.31, -0.08, +0.19, -0.02, ...]
signs     = [  +1,   -1,   +1,   -1, ...]  ← 1 bit each
‖residual‖ = 0.38                              ← 1 float (32 bits)

# Reconstruction:
r_hat = scale_factor × ‖r‖ × Sᵀ @ signs
      = 0.0098 × 0.38 × Sᵀ @ signs
      ≈ [+0.09, -0.09, +0.09, -0.09, ...]  ← same magnitude everywhere!
The critical flaw: all 128 coordinates get the same reconstructed magnitude (~0.09), regardless of whether the original residual coordinate was 0.31 or 0.02. Magnitudes are lost.

The result: large residual coordinates are under-corrected, small ones are over-corrected. This random mismatch is exactly the variance QJL introduces.

QJL Sign Quantization — what gets stored vs what gets lost
160 bits
Storage (signs + norm)
Cosine Similarity
Magnitudes lost
What QJL throws away
§ 03

Bias vs Variance — The Archer Analogy

QJL reduces bias — the average error across many vectors. But it increases variance — the spread of errors for individual coordinates. Understanding why this matters requires understanding these two concepts.

The Archer Analogy

Imagine two archers shooting at a bullseye. The bullseye is the true attention score.

Archer analogy — bias vs variance
Bias
Variance
For Attention

Archer A (PolarQuant only): All arrows land slightly to the left — consistent error. But they're clustered together. Softmax sees similar errors everywhere — stays stable.

Archer B (PolarQuant + QJL): Arrows are centered on average — but scattered randomly. One arrow might fly far to the right. Softmax exponentially amplifies that one outlier.

⚡ Key Insight

For attention: low variance beats low bias. A consistent small error is far less damaging than a random large one — because softmax is exponential, not linear.

§ 04

Why Softmax Amplifies Variance

Softmax converts attention scores into probabilities. The formula is exponential:

softmax(sᵢ) = exp(sᵢ) / Σ exp(sⱼ)

This means a small increase in one score causes a disproportionately large increase in its probability weight — at the expense of all others.

Drag the slider below to add noise to the top attention score and watch what happens to the weights:

Softmax amplification — add noise to one score
noise on top score: ±0.00
Top Token Weight
Weight Shift
Stable
Attention Status

Notice how even a small noise of +0.2 on the top score dramatically changes the weight distribution. This is the danger of high variance — softmax turns small random errors into large attention shifts.

QJL's random per-coordinate noise → some scores randomly too high → softmax exponentially amplifies them → wrong output. Even though average error (bias) is lower.
§ 05

The turbo4 Resurrection

This is one of the most interesting stories in the TurboQuant project. The original turbo4 was broken by 7 implementation bugs (perplexity = 679, completely unusable). After fixing the bugs:

VersionStrategyCentroidsPPLStatus
Original turbo4 (buggy)PQ3 + QJL18679Broken
turbo4 (bugs fixed)PQ3 + QJL186.19Working
turbo4 (QJL disabled)PQ3 only86.18QJL was hurting!
New turbo4PQ4 only166.125Best ✓

The fix journey revealed something fundamental: spending all 4 bits on PolarQuant centroids (16 centroids) beats spending 3 bits on PolarQuant + 1 bit on QJL residual (8 centroids + signs).

Why More Centroids Wins

Going from 8 → 16 centroids directly halves the slot width:

slot width = range / (n_levels − 1) 8 centroids: range / 7 → larger gaps 16 centroids: range / 15 → smaller gaps → smaller error

And critically — this smaller error is consistent across every coordinate. No random noise. No variance explosion through softmax.

Centroids vs attention quality — interactive comparison
Centroids
Slot Width
Attention Quality
§ 06

The Lesson

Three independent research groups (scos-lab, Arclabs/YATQ, AmesianX) all confirmed the same finding:

📐 The Rule

For attention workloads — more centroids always beats error correction. Spend every bit on PolarQuant. Never split bits between PolarQuant and QJL.

ApproachBitsCentroidsQJL?Attention Quality
turbo334 (PQ2)YesGood enough
Old turbo448 (PQ3)YesWorse (variance)
New turbo4416 (PQ4)NoBest ✓

QJL's mathematical guarantee — unbiased inner products — doesn't translate to better attention quality. Theory said it should help. Practice said it hurts. The exponential nature of softmax is the reason.

The Complete Chain — All 4 Modules

Module 1: Naive quantization fails
          → outliers stretch range → slots wasted

Module 2: Rotate first (WHT + random signs)
          → outliers spread → range collapses
          → coords follow N(0, 1/d)

Module 3: Exploit N(0,1/d)
          → Lloyd-Max optimal centroids
          → PolarQuant = 7× compression, 0.94 cosine

Module 4: More centroids > error correction
          → QJL adds variance → softmax amplifies it
          → new turbo4 = pure 4-bit PolarQuant → beats q4_0