K and V are Not Equal
The attention formula has two caches. They look symmetric — both are matrices of shape (seq_len × d). But they play fundamentally different roles in producing the output:
K determines which tokens get attention — its values flow through softmax before contributing to output. V determines what information flows — its values are simply weighted and summed after the softmax weights are fixed.
This asymmetry has a dramatic consequence for quantization errors. Add the same amount of noise to K and to V — and see which causes more damage:
Why Softmax Amplifies K Errors
The difference comes down to one equation. Softmax is exponential:
A small perturbation ε in logit space doesn't add a small amount to the attention weight — it multiplies it by eᵉ. For ε = 0.5, that's a 65% weight increase. For ε = 1.0, it's 172%.
V errors bypass softmax entirely. An error ε in a V value contributes at most wᵢ × ε to the output — where wᵢ is an attention weight always ≤ 1. The error gets scaled down, not amplified.
Drag the slider to add noise to the top attention score. Watch how the weight distribution shifts:
K quantization error → Δscore in dot product → softmax(score + Δscore) → exponential weight shift → wrong output. V quantization error → weight × (V + ε) → error bounded by wᵢ × ε → always small.
Same Bits, Opposite Results
The most dramatic demonstration: take a fixed bit budget of ~5.8 bits per value average. Split it two ways:
| Config | K bits | V bits | Avg bpv | Compression | PPL impact |
|---|---|---|---|---|---|
| q8_0 / turbo3 | 8.5 | 3.125 | 5.8 | 2.8× | +1.1–2.1% |
| turbo3 / q8_0 | 3.125 | 8.5 | 5.8 | 2.8× | +catastrophic |
Identical average bits. 500× quality difference — purely from deciding which cache gets compressed.
Click each config to see its attention output quality:
The Safe Decision Tree
Model weights? │ ├── Q8_0 or higher │ ├── Best: -ctk turbo4 -ctv turbo4 │ └── More: -ctk turbo3 -ctv turbo3 │ └── Q4_K_M ├── Qwen2.5 → MUST asymmetric: -ctk q8_0 -ctv turbo3 ├── Unknown → Safe default: -ctk q8_0 -ctv turbo4 └── Llama/Mistral 24B+ → Try symmetric, fallback to asymmetric
Boundary V: Not All Layers Are Equal
You know V can be compressed aggressively. But not every layer tolerates the same level of compression. The first 2 and last 2 layers are uniquely exposed:
| Layers | What they do | Error behaviour |
|---|---|---|
| First 2 | Raw tokens → first hidden state | Errors propagate through ALL remaining layers |
| Middle N-4 | Feature transformation | Errors propagate through fewer layers — network self-corrects |
| Last 2 | Hidden state → logit predictions | Errors go directly into token prediction — no recovery |
Boundary V protects exactly those 4 layers with high precision V, compresses everything else aggressively:
Adjust the model depth to see how the compression cost and quality recovery change:
| Model | turbo2 PPL | Boundary V PPL | turbo3 PPL | Recovered |
|---|---|---|---|---|
| phi-4 (40 layers) | 4.835 | 4.784 | 4.742 | 55% |
| Qwen2.5-7B (28 layers) | 6.911 | 6.835 | 6.707 | 37% |
| Qwen3.5-35B MoE (64 layers) | 5.257 | 5.148 | 5.137 | 91% |
The 64-layer model recovers 91% of the quality gap because 4 boundary layers represent a smaller fraction of a deeper model — same protection, lower proportional cost.
Sparse V: Skip What Doesn't Matter
After computing softmax(QKᵀ/√d), you get attention weights over all tokens. At long context, most of those weights are negligibly small — the model is paying essentially zero attention to most past tokens.
If a token has weight 0.000001, its V vector contributes 0.000001 × V to the output — essentially nothing. Yet standard attention still dequantizes every V vector, loading all those indices and norm values from memory.
Sparse V skips dequantization for tokens below a threshold. The weights are computed from K before V is touched — so we already know which tokens to skip.
+22.8% decode throughput at 32K context on MoE models (M5 Max). Zero perplexity impact (validated at 32K, 50 chunks, wikitext-103). Sparse V actually improved NIAH retrieval (9/9 vs 7/9) — dequantizing near-zero positions may introduce tiny artifacts. Skipping them removes that noise.
Block Size 128: Free Compression
PolarQuant stores each vector as: indices (2/3/4 bits each) + one norm (2 bytes). The original implementation organised storage in 32-element blocks — each block getting its own norm.
For a 128-dimensional vector:
Block size 32: 128 dims / 32 = 4 blocks → 4 norms × 2 bytes = 8 bytes But all 4 norms store the SAME VALUE — 3 are pure redundancy! Block size 128: 128 dims / 128 = 1 block → 1 norm × 2 bytes = 2 bytes Save 6 bytes per vector. Zero quality change.
The Complete Optimization Stack
All six layers are orthogonal — they don't interfere with each other. Enable all six and you get all six benefits simultaneously.
PolarQuant Core
WHT rotation → N(0,1/d) → Lloyd-Max optimal codebook → norm handling
Algorithm Choice
turbo4 for K (no QJL). MSE-only PolarQuant for V. More centroids beats error correction.
Asymmetric K/V
K at q8_0 — protected from softmax amplification. V compressed aggressively (turbo2–turbo4).
Boundary V
First 2 + last 2 layers at q8_0-V. Middle layers at turbo2-V. Recovers 37–91% quality gap.
Sparse V
Skip dequant for attention weights below 1e-6. +22.8% decode speed. Zero quality loss.
Block Size 128
One norm per vector instead of four. +12% compression for free. Zero quality impact.
| Final result | Value |
|---|---|
| KV cache compression | 3.8–6.4× |
| PPL impact | +0.23% to +6.5% |
| Decode speed at 32K context | +22.8% |
| Memory saved at 128K / 70B | ~5 GB |