A Visual Essay on Quantization

Why Rotating Vectors Makes Compression Beautiful

From outliers to the Walsh-Hadamard Transform — a ground-up intuition for TurboQuant and PolarQuant

scroll to explore
§ 01

The Compression Problem

Modern language models store enormous tables of numbers called KV-cache vectors. Each vector has 128, 256, or even 4096 dimensions. Storing them at full precision (32-bit floats) is expensive. So we quantize — we round each number to the nearest slot on a discrete scale.

The simplest approach: divide the range of values into equally-spaced levels. With 4 bits you get 2⁴ = 16 levels. With 2 bits, just 4 levels.

Think of it like a ruler with only a few markings. The fewer markings, the less precisely you can measure — but the more space you save.

The problem? Real KV-cache vectors are not well-behaved. A handful of coordinates are enormous. Most are near zero. This is the outlier problem.

§ 02

The Outlier Problem — Interactive

Below is a simulated KV-cache vector with 64 coordinates. Five channels (in red) have values 20× larger than the rest. Use the slider to change the number of quantization bits and watch what happens.

Original Vector — Before Quantization
Quantized Vector — After Naive Quantization
bits: 2 bits = 4 levels
MSE
Range
Slot Width
Wasted Slots

Notice how the slot width scale = (max − min) / (n_levels − 1) is enormous because of the outliers. All the normal coordinates get collapsed into just a couple of slots. The ruler is stretched by the tall students.

⚡ Key Insight

Worst-case reconstruction error = 0.5 × scale = 0.5 × (max − min) / (n_levels − 1). Large range → large error. The outliers are not just annoying — they mathematically guarantee poor reconstruction for everyone else.

§ 03

The Fix: Rotate First

What if we applied a transformation that spreads the outlier energy evenly across all coordinates — before quantizing?

A rotation matrix R does exactly this. It has three magical properties:

PropertyMathWhat it means
Preserves length‖Rx‖ = ‖x‖No information destroyed
Preserves dot products(Rx)·(Ry) = x·yAttention scores unchanged
Fully reversiblex = Rᵀ(Rx)Perfect reconstruction possible

Click Rotate to see what happens to our outlier vector after a random rotation:

Coordinate Values — Before vs After Rotation
Click Rotate to transform
22.1×Max/Mean Before
Max/Mean After
PreservedNorm
The outlier energy didn't disappear — it was redistributed. Like pouring water from a tall thin glass into a wide flat bowl. The total amount of water (norm) is the same, but no single spot is overwhelmingly deep.

After rotation, all coordinates have similar magnitudes. Now the quantization range is small, slot width is tiny, and every slot does useful work. Reconstruction error collapses.

§ 04

The Exact Distribution After Rotation

Here's what makes this truly powerful. For a unit-norm vector rotated by a random rotation matrix, we don't just know the values spread out — we know exactly what distribution they follow.

Each coordinate of Rx ≈ N(0, 1/d) for large d

A bell curve centered at zero, with spread σ = 1/√d. For d = 128, that's σ ≈ 0.088.

This is extraordinary. No matter what vector you start with, after rotation, every coordinate follows the same predictable bell curve. We can verify this experimentally:

Distribution of coordinates after rotation (10,000 random vectors)
Press to simulate 10,000 rotations
0.0884Theoretical σ
Empirical σ
💡 Why This Matters for Quantization

Naive quantization places slots uniformly — equal gaps everywhere. But if we know the distribution is N(0, 1/d), we can place more slots near zero (where most data lives) and fewer at the edges (where almost nothing is). Same number of bits. Much better precision. This is PolarQuant.

§ 05

The Walsh-Hadamard Transform: Fast Rotation

Dense random rotation requires O(d²) operations — a full matrix-vector multiply. For d = 128:

128 × 128 = 16,384 operations per vector

When you're processing millions of vectors in a language model, that's too slow. The Walsh-Hadamard Transform (WHT) achieves the same spreading effect using only additions and subtractions, in O(d log d):

128 × log₂(128) = 128 × 7 = 896 operations — 18× fewer

The Butterfly Pattern

WHT works in stages. At each stage, pairs of elements are added and subtracted. The pattern looks like butterfly wings — hence the name.

Each butterfly operation takes two numbers a and b and produces:

a ──┬──→  a + b
    ✕
b ──┴──→  a - b

No multiplication. Just + and . Step through the animation below to watch information propagate from one coordinate to all coordinates:

WHT Butterfly — 8 Elements, 3 Stages
Stage 0 of 3
Initial: all information sits at coordinate 0.

The Hadamard Matrix

If you write down the results of all possible butterfly patterns as a grid, you get the Hadamard matrix H — a matrix containing only +1 and −1. It's built recursively:

H₁ = [1]

H₂ = [ H₁   H₁ ] = [ +1  +1 ]
     [ H₁  -H₁ ]   [ +1  -1 ]

H₄ = [ H₂   H₂ ]  ...and so on
     [ H₂  -H₂ ]

Each size is two copies of the previous — one normal, one sign-flipped. This recursive doubling is why WHT achieves its O(d log d) efficiency.

Hadamard Matrix H₈ — only +1 and −1
§ 06

The Missing Ingredient: Random Signs

WHT alone has a problem: the Hadamard matrix H is fixed. Every vector gets the exact same transformation. This breaks the randomness guarantee we need for N(0, 1/d).

The Xerox Machine Analogy: Imagine a cheap photocopier that can only print 4 shades of grey. Ravi's answer sheet [10, 2, 1] and Priya's [9, 3, 1] both get photocopied to [dark, light, light]. The copies are identical. When the teacher tries to reconstruct the original marks — she can't. Two different inputs collapsed into one output. Information was permanently lost.

Random signs prevent this collision. Before and after WHT, we multiply each coordinate by a random +1 or −1:

# Step 1: random sign flip
x = x * signs1        # e.g. [+1,-1,+1,+1,-1,...]

# Step 2: WHT butterfly spread
y = WHT(x)

# Step 3: random sign flip again
y = y * signs2

Now two similar vectors, after being scrambled by different random signs, produce completely different outputs from WHT. No collision possible. The N(0, 1/d) guarantee holds.

Collision Demo — Same WHT vs Sign-Scrambled WHT
Two similar vectors, A and B
§ 06

The Complete Pipeline

Now we can assemble the full picture. This is how TurboQuant/PolarQuant processes every KV-cache vector:

📥

Input

Raw KV-cache vector with outlier channels. Max/Mean ratio ≈ 22×. Naive quantization fails here.

🎲

Step 1 — Random Sign Flip

Multiply each coordinate by random ±1. Breaks predictability, ensures no two vectors collide.

🦋

Step 2 — Walsh-Hadamard Transform

Butterfly spread across log₂(d) stages. Information propagates from every coordinate to every other. Only additions and subtractions. O(d log d).

🎲

Step 3 — Random Sign Flip Again

Second randomization pass. Together with Step 1, this guarantees the N(0, 1/d) distribution.

📐

Step 4 — Optimal Quantization

Since we know coordinates follow N(0, 1/d), we place more slots near zero, fewer at the edges. Maximum information per bit.

📦

Output

Compressed vector. To reconstruct: dequantize → inverse sign flip → inverse WHT → inverse sign flip.

Why This Works — The One-Line Proof

🎯 The Core Chain

Outliers stretch the range → large slot width → large reconstruction error. Rotation spreads outliers across all dimensions → range collapses → tiny slot width → tiny error. And since we know the post-rotation distribution is always N(0, 1/d), we can precompute optimal slot positions. WHT makes this fast enough for real-time inference.

MethodCostHandles OutliersOptimal Slots
Naive QuantizationO(d)
Dense Rotation + QuantO(d²)
WHT + PolarQuantO(d log d)