The Compression Problem
Modern language models store enormous tables of numbers called KV-cache vectors. Each vector has 128, 256, or even 4096 dimensions. Storing them at full precision (32-bit floats) is expensive. So we quantize — we round each number to the nearest slot on a discrete scale.
The simplest approach: divide the range of values into equally-spaced levels. With 4 bits you get 2⁴ = 16 levels. With 2 bits, just 4 levels.
The problem? Real KV-cache vectors are not well-behaved. A handful of coordinates are enormous. Most are near zero. This is the outlier problem.
The Outlier Problem — Interactive
Below is a simulated KV-cache vector with 64 coordinates. Five channels (in red) have values 20× larger than the rest. Use the slider to change the number of quantization bits and watch what happens.
Notice how the slot width scale = (max − min) / (n_levels − 1) is enormous because of the outliers. All the normal coordinates get collapsed into just a couple of slots. The ruler is stretched by the tall students.
Worst-case reconstruction error = 0.5 × scale = 0.5 × (max − min) / (n_levels − 1). Large range → large error. The outliers are not just annoying — they mathematically guarantee poor reconstruction for everyone else.
The Fix: Rotate First
What if we applied a transformation that spreads the outlier energy evenly across all coordinates — before quantizing?
A rotation matrix R does exactly this. It has three magical properties:
| Property | Math | What it means |
|---|---|---|
| Preserves length | ‖Rx‖ = ‖x‖ | No information destroyed |
| Preserves dot products | (Rx)·(Ry) = x·y | Attention scores unchanged |
| Fully reversible | x = Rᵀ(Rx) | Perfect reconstruction possible |
Click Rotate to see what happens to our outlier vector after a random rotation:
After rotation, all coordinates have similar magnitudes. Now the quantization range is small, slot width is tiny, and every slot does useful work. Reconstruction error collapses.
The Exact Distribution After Rotation
Here's what makes this truly powerful. For a unit-norm vector rotated by a random rotation matrix, we don't just know the values spread out — we know exactly what distribution they follow.
A bell curve centered at zero, with spread σ = 1/√d. For d = 128, that's σ ≈ 0.088.
This is extraordinary. No matter what vector you start with, after rotation, every coordinate follows the same predictable bell curve. We can verify this experimentally:
Naive quantization places slots uniformly — equal gaps everywhere. But if we know the distribution is N(0, 1/d), we can place more slots near zero (where most data lives) and fewer at the edges (where almost nothing is). Same number of bits. Much better precision. This is PolarQuant.
The Walsh-Hadamard Transform: Fast Rotation
Dense random rotation requires O(d²) operations — a full matrix-vector multiply. For d = 128:
When you're processing millions of vectors in a language model, that's too slow. The Walsh-Hadamard Transform (WHT) achieves the same spreading effect using only additions and subtractions, in O(d log d):
The Butterfly Pattern
WHT works in stages. At each stage, pairs of elements are added and subtracted. The pattern looks like butterfly wings — hence the name.
Each butterfly operation takes two numbers a and b and produces:
a ──┬──→ a + b
✕
b ──┴──→ a - b
No multiplication. Just + and −. Step through the animation below to watch information propagate from one coordinate to all coordinates:
The Hadamard Matrix
If you write down the results of all possible butterfly patterns as a grid, you get the Hadamard matrix H — a matrix containing only +1 and −1. It's built recursively:
H₁ = [1]
H₂ = [ H₁ H₁ ] = [ +1 +1 ]
[ H₁ -H₁ ] [ +1 -1 ]
H₄ = [ H₂ H₂ ] ...and so on
[ H₂ -H₂ ]
Each size is two copies of the previous — one normal, one sign-flipped. This recursive doubling is why WHT achieves its O(d log d) efficiency.
The Missing Ingredient: Random Signs
WHT alone has a problem: the Hadamard matrix H is fixed. Every vector gets the exact same transformation. This breaks the randomness guarantee we need for N(0, 1/d).
[10, 2, 1] and Priya's [9, 3, 1] both get photocopied to [dark, light, light]. The copies are identical. When the teacher tries to reconstruct the original marks — she can't. Two different inputs collapsed into one output. Information was permanently lost.
Random signs prevent this collision. Before and after WHT, we multiply each coordinate by a random +1 or −1:
# Step 1: random sign flip
x = x * signs1 # e.g. [+1,-1,+1,+1,-1,...]
# Step 2: WHT butterfly spread
y = WHT(x)
# Step 3: random sign flip again
y = y * signs2
Now two similar vectors, after being scrambled by different random signs, produce completely different outputs from WHT. No collision possible. The N(0, 1/d) guarantee holds.
The Complete Pipeline
Now we can assemble the full picture. This is how TurboQuant/PolarQuant processes every KV-cache vector:
Input
Raw KV-cache vector with outlier channels. Max/Mean ratio ≈ 22×. Naive quantization fails here.
Step 1 — Random Sign Flip
Multiply each coordinate by random ±1. Breaks predictability, ensures no two vectors collide.
Step 2 — Walsh-Hadamard Transform
Butterfly spread across log₂(d) stages. Information propagates from every coordinate to every other. Only additions and subtractions. O(d log d).
Step 3 — Random Sign Flip Again
Second randomization pass. Together with Step 1, this guarantees the N(0, 1/d) distribution.
Step 4 — Optimal Quantization
Since we know coordinates follow N(0, 1/d), we place more slots near zero, fewer at the edges. Maximum information per bit.
Output
Compressed vector. To reconstruct: dequantize → inverse sign flip → inverse WHT → inverse sign flip.
Why This Works — The One-Line Proof
Outliers stretch the range → large slot width → large reconstruction error. Rotation spreads outliers across all dimensions → range collapses → tiny slot width → tiny error. And since we know the post-rotation distribution is always N(0, 1/d), we can precompute optimal slot positions. WHT makes this fast enough for real-time inference.
| Method | Cost | Handles Outliers | Optimal Slots |
|---|---|---|---|
| Naive Quantization | O(d) | ✗ | ✗ |
| Dense Rotation + Quant | O(d²) | ✓ | ✓ |
| WHT + PolarQuant | O(d log d) | ✓ | ✓ |