Modules 5 & 6 · TurboQuant+

The Art of Selective Precision

Not every cache, layer, or token deserves the same treatment. Here's how TurboQuant learns to choose.

Module 5 · § 01

K and V are Not Equal

The attention formula has two caches. They look symmetric — both are matrices of shape (seq_len × d). But they play fundamentally different roles in producing the output:

Output = softmax( Q · Kᵀ / √d ) · V

K determines which tokens get attention — its values flow through softmax before contributing to output. V determines what information flows — its values are simply weighted and summed after the softmax weights are fixed.

This asymmetry has a dramatic consequence for quantization errors. Add the same amount of noise to K and to V — and see which causes more damage:

K error vs V error — equal noise, different damage
noise σ: σ = 0.05
K noise → output MSE
V noise → output MSE
K/V damage ratio
K errors cause 8–9× more damage than V errors of equal magnitude — every single time, across every noise level.
Module 5 · § 02

Why Softmax Amplifies K Errors

The difference comes down to one equation. Softmax is exponential:

softmax(zᵢ + ε) ≈ softmax(zᵢ) × eᵉ

A small perturbation ε in logit space doesn't add a small amount to the attention weight — it multiplies it by eᵉ. For ε = 0.5, that's a 65% weight increase. For ε = 1.0, it's 172%.

V errors bypass softmax entirely. An error ε in a V value contributes at most wᵢ × ε to the output — where wᵢ is an attention weight always ≤ 1. The error gets scaled down, not amplified.

Drag the slider to add noise to the top attention score. Watch how the weight distribution shifts:

Softmax amplification — perturb one score, watch all weights shift
perturbation ε: ε = 0.00
Top token weight
Weight shift
Stable
Attention status
📐 The math

K quantization error → Δscore in dot product → softmax(score + Δscore) → exponential weight shift → wrong output. V quantization error → weight × (V + ε) → error bounded by wᵢ × ε → always small.

Module 5 · § 03

Same Bits, Opposite Results

The most dramatic demonstration: take a fixed bit budget of ~5.8 bits per value average. Split it two ways:

ConfigK bitsV bitsAvg bpvCompressionPPL impact
q8_0 / turbo38.53.1255.82.8×+1.1–2.1%
turbo3 / q8_03.1258.55.82.8×+catastrophic

Identical average bits. 500× quality difference — purely from deciding which cache gets compressed.

Click each config to see its attention output quality:

K/V config comparison — attention output MSE
Attention MSE
Verdict

The Safe Decision Tree

Model weights?
│
├── Q8_0 or higher
│   ├── Best:        -ctk turbo4 -ctv turbo4
│   └── More:        -ctk turbo3 -ctv turbo3
│
└── Q4_K_M
    ├── Qwen2.5     → MUST asymmetric: -ctk q8_0 -ctv turbo3
    ├── Unknown      → Safe default:  -ctk q8_0 -ctv turbo4
    └── Llama/Mistral 24B+ → Try symmetric, fallback to asymmetric
The safe universal default: keep K at q8_0, compress V as aggressively as your quality budget allows.
Module 6 — Advanced Optimizations
Module 6 · § 01

Boundary V: Not All Layers Are Equal

You know V can be compressed aggressively. But not every layer tolerates the same level of compression. The first 2 and last 2 layers are uniquely exposed:

LayersWhat they doError behaviour
First 2Raw tokens → first hidden stateErrors propagate through ALL remaining layers
Middle N-4Feature transformationErrors propagate through fewer layers — network self-corrects
Last 2Hidden state → logit predictionsErrors go directly into token prediction — no recovery

Boundary V protects exactly those 4 layers with high precision V, compresses everything else aggressively:

layers 0, 1, N−2, N−1 → V at q8_0 (8.5 bits) all middle layers → V at turbo2 (2.5 bits)

Adjust the model depth to see how the compression cost and quality recovery change:

Boundary V — per-layer precision and compression analysis
layers: 32 layers
Boundary V bpv
Overhead vs turbo2
Layers protected
Quality recovery
Modelturbo2 PPLBoundary V PPLturbo3 PPLRecovered
phi-4 (40 layers)4.8354.7844.74255%
Qwen2.5-7B (28 layers)6.9116.8356.70737%
Qwen3.5-35B MoE (64 layers)5.2575.1485.13791%

The 64-layer model recovers 91% of the quality gap because 4 boundary layers represent a smaller fraction of a deeper model — same protection, lower proportional cost.

Module 6 · § 02

Sparse V: Skip What Doesn't Matter

After computing softmax(QKᵀ/√d), you get attention weights over all tokens. At long context, most of those weights are negligibly small — the model is paying essentially zero attention to most past tokens.

If a token has weight 0.000001, its V vector contributes 0.000001 × V to the output — essentially nothing. Yet standard attention still dequantizes every V vector, loading all those indices and norm values from memory.

Sparse V skips dequantization for tokens below a threshold. The weights are computed from K before V is touched — so we already know which tokens to skip.

Attention weight distribution — how sparse is attention at long context?
context length: 1024 tokens
Skippable (w < 1e-6)
Top 5 tokens hold
≈ 0
Quality impact
Skipping 0.1% sounds tiny — but at 32K context that's 32 fewer vector loads per head per token. Memory bandwidth is the bottleneck at long context, not compute. Small reductions in loads give disproportionate speedups.
⚡ Real-world results

+22.8% decode throughput at 32K context on MoE models (M5 Max). Zero perplexity impact (validated at 32K, 50 chunks, wikitext-103). Sparse V actually improved NIAH retrieval (9/9 vs 7/9) — dequantizing near-zero positions may introduce tiny artifacts. Skipping them removes that noise.

Module 6 · § 03

Block Size 128: Free Compression

PolarQuant stores each vector as: indices (2/3/4 bits each) + one norm (2 bytes). The original implementation organised storage in 32-element blocks — each block getting its own norm.

For a 128-dimensional vector:

Block size 32:  128 dims / 32 = 4 blocks → 4 norms × 2 bytes = 8 bytes
But all 4 norms store the SAME VALUE — 3 are pure redundancy!

Block size 128: 128 dims / 128 = 1 block → 1 norm × 2 bytes = 2 bytes
Save 6 bytes per vector. Zero quality change.
Block size vs bits-per-value — eliminating redundant norms
3.375 bpv
Block size 32
3.125 bpv
Block size 128
+12%
Compression gain
Zero
Quality impact
Modules 5 & 6 · Summary

The Complete Optimization Stack

All six layers are orthogonal — they don't interfere with each other. Enable all six and you get all six benefits simultaneously.

01
🔄

PolarQuant Core

WHT rotation → N(0,1/d) → Lloyd-Max optimal codebook → norm handling

02
🎯

Algorithm Choice

turbo4 for K (no QJL). MSE-only PolarQuant for V. More centroids beats error correction.

03
⚖️

Asymmetric K/V

K at q8_0 — protected from softmax amplification. V compressed aggressively (turbo2–turbo4).

04
🛡️

Boundary V

First 2 + last 2 layers at q8_0-V. Middle layers at turbo2-V. Recovers 37–91% quality gap.

05

Sparse V

Skip dequant for attention weights below 1e-6. +22.8% decode speed. Zero quality loss.

06
📦

Block Size 128

One norm per vector instead of four. +12% compression for free. Zero quality impact.

Cumulative improvement — enabling optimizations one by one
Layer 0 / 6
1.0×
Compression
1.0×
Decode speed
Baseline
Quality
Final resultValue
KV cache compression3.8–6.4×
PPL impact+0.23% to +6.5%
Decode speed at 32K context+22.8%
Memory saved at 128K / 70B~5 GB