The QJL Lesson — Why More Centroids Beat Error Correction

§ 01

What is a Residual?

PolarQuant compresses a vector and reconstructs it as x̂. The reconstruction is good — but not perfect. The difference between the original and reconstruction is called the residual:

residual = x − x̂

It's not a single number like MSE. It's a full vector — the leftover information that quantization failed to capture at every coordinate.

Think of PolarQuant as a photographer who takes a photo with a slightly blurry lens. The photo (x̂) is close but not perfect. The residual is the blur — the exact difference between the original scene and the photo.

The notebook found that after 2-bit PolarQuant, the residual carries 56.5% of the original norm. More than half the information survives in the residual. Naturally, the question is — can we compress and store the residual too?

Residual = Original − Reconstructed (interactive)

bits: 2 bits

—

Residual / Original Norm

—

MSE

—

Info Captured

Notice — as you increase bits, the residual gets smaller (less leftover). With 1-bit, the residual is enormous. With 4-bit, it's tiny. This directly shows you how much information each bit-width captures.

§ 02

QJL: Compressing the Residual with 1 Bit

The residual is a 128-dimensional float vector. To compress it with just 1 bit per coordinate, you can only store two values: +1 or −1.

So QJL stores only the sign of each residual coordinate — plus one global norm value:

# What gets stored:
residual  = [+0.31, -0.08, +0.19, -0.02, ...]
signs     = [  +1,   -1,   +1,   -1, ...]  ← 1 bit each
‖residual‖ = 0.38                              ← 1 float (32 bits)

# Reconstruction:
r_hat = scale_factor × ‖r‖ × Sᵀ @ signs
      = 0.0098 × 0.38 × Sᵀ @ signs
      ≈ [+0.09, -0.09, +0.09, -0.09, ...]  ← same magnitude everywhere!

The critical flaw: all 128 coordinates get the same reconstructed magnitude (~0.09), regardless of whether the original residual coordinate was 0.31 or 0.02. Magnitudes are lost.

The result: large residual coordinates are under-corrected, small ones are over-corrected. This random mismatch is exactly the variance QJL introduces.

QJL Sign Quantization — what gets stored vs what gets lost

160 bits

Storage (signs + norm)

—

Cosine Similarity

Magnitudes lost

What QJL throws away

§ 03

Bias vs Variance — The Archer Analogy

QJL reduces bias — the average error across many vectors. But it increases variance — the spread of errors for individual coordinates. Understanding why this matters requires understanding these two concepts.

The Archer Analogy

Imagine two archers shooting at a bullseye. The bullseye is the true attention score.

Archer analogy — bias vs variance

—

Bias

—

Variance

—

For Attention

Archer A (PolarQuant only): All arrows land slightly to the left — consistent error. But they're clustered together. Softmax sees similar errors everywhere — stays stable.

Archer B (PolarQuant + QJL): Arrows are centered on average — but scattered randomly. One arrow might fly far to the right. Softmax exponentially amplifies that one outlier.

⚡ Key Insight

For attention: low variance beats low bias. A consistent small error is far less damaging than a random large one — because softmax is exponential, not linear.

§ 04

Why Softmax Amplifies Variance

Softmax converts attention scores into probabilities. The formula is exponential:

softmax(sᵢ) = exp(sᵢ) / Σ exp(sⱼ)

This means a small increase in one score causes a disproportionately large increase in its probability weight — at the expense of all others.

Drag the slider below to add noise to the top attention score and watch what happens to the weights:

Softmax amplification — add noise to one score

noise on top score: ±0.00

—

Top Token Weight

—

Weight Shift

Stable

Attention Status

Notice how even a small noise of +0.2 on the top score dramatically changes the weight distribution. This is the danger of high variance — softmax turns small random errors into large attention shifts.

QJL's random per-coordinate noise → some scores randomly too high → softmax exponentially amplifies them → wrong output. Even though average error (bias) is lower.

§ 05

The turbo4 Resurrection

This is one of the most interesting stories in the TurboQuant project. The original turbo4 was broken by 7 implementation bugs (perplexity = 679, completely unusable). After fixing the bugs:

Version	Strategy	Centroids	PPL	Status
Original turbo4 (buggy)	PQ3 + QJL1	8	679	Broken
turbo4 (bugs fixed)	PQ3 + QJL1	8	6.19	Working
turbo4 (QJL disabled)	PQ3 only	8	6.18	QJL was hurting!
New turbo4	PQ4 only	16	6.125	Best ✓

The fix journey revealed something fundamental: spending all 4 bits on PolarQuant centroids (16 centroids) beats spending 3 bits on PolarQuant + 1 bit on QJL residual (8 centroids + signs).

Why More Centroids Wins

Going from 8 → 16 centroids directly halves the slot width:

slot width = range / (n_levels − 1) 8 centroids: range / 7 → larger gaps 16 centroids: range / 15 → smaller gaps → smaller error

And critically — this smaller error is consistent across every coordinate. No random noise. No variance explosion through softmax.

Centroids vs attention quality — interactive comparison

—

Centroids

—

Slot Width

—

Attention Quality

§ 06

The Lesson

Three independent research groups (scos-lab, Arclabs/YATQ, AmesianX) all confirmed the same finding:

📐 The Rule

For attention workloads — more centroids always beats error correction. Spend every bit on PolarQuant. Never split bits between PolarQuant and QJL.

Approach	Bits	Centroids	QJL?	Attention Quality
turbo3	3	4 (PQ2)	Yes	Good enough
Old turbo4	4	8 (PQ3)	Yes	Worse (variance)
New turbo4	4	16 (PQ4)	No	Best ✓

QJL's mathematical guarantee — unbiased inner products — doesn't translate to better attention quality. Theory said it should help. Practice said it hurts. The exponential nature of softmax is the reason.

The Complete Chain — All 4 Modules

Module 1: Naive quantization fails
          → outliers stretch range → slots wasted

Module 2: Rotate first (WHT + random signs)
          → outliers spread → range collapses
          → coords follow N(0, 1/d)

Module 3: Exploit N(0,1/d)
          → Lloyd-Max optimal centroids
          → PolarQuant = 7× compression, 0.94 cosine

Module 4: More centroids > error correction
          → QJL adds variance → softmax amplifies it
          → new turbo4 = pure 4-bit PolarQuant → beats q4_0