The Art of Selective Precision

Module 5 · § 01

K and V are Not Equal

The attention formula has two caches. They look symmetric — both are matrices of shape (seq_len × d). But they play fundamentally different roles in producing the output:

Output = softmax( Q · Kᵀ / √d ) · V

K determines which tokens get attention — its values flow through softmax before contributing to output. V determines what information flows — its values are simply weighted and summed after the softmax weights are fixed.

This asymmetry has a dramatic consequence for quantization errors. Add the same amount of noise to K and to V — and see which causes more damage:

K error vs V error — equal noise, different damage

noise σ: σ = 0.05

—

K noise → output MSE

—

V noise → output MSE

—

K/V damage ratio

K errors cause 8–9× more damage than V errors of equal magnitude — every single time, across every noise level.

Module 5 · § 02

Why Softmax Amplifies K Errors

The difference comes down to one equation. Softmax is exponential:

softmax(zᵢ + ε) ≈ softmax(zᵢ) × eᵉ

A small perturbation ε in logit space doesn't add a small amount to the attention weight — it multiplies it by eᵉ. For ε = 0.5, that's a 65% weight increase. For ε = 1.0, it's 172%.

V errors bypass softmax entirely. An error ε in a V value contributes at most wᵢ × ε to the output — where wᵢ is an attention weight always ≤ 1. The error gets scaled down, not amplified.

Drag the slider to add noise to the top attention score. Watch how the weight distribution shifts:

Softmax amplification — perturb one score, watch all weights shift

perturbation ε: ε = 0.00

—

Top token weight

—

Weight shift

Stable

Attention status

📐 The math

K quantization error → Δscore in dot product → softmax(score + Δscore) → exponential weight shift → wrong output. V quantization error → weight × (V + ε) → error bounded by wᵢ × ε → always small.

Module 5 · § 03

Same Bits, Opposite Results

The most dramatic demonstration: take a fixed bit budget of ~5.8 bits per value average. Split it two ways:

Config	K bits	V bits	Avg bpv	Compression	PPL impact
q8_0 / turbo3	8.5	3.125	5.8	2.8×	+1.1–2.1%
turbo3 / q8_0	3.125	8.5	5.8	2.8×	+catastrophic

Identical average bits. 500× quality difference — purely from deciding which cache gets compressed.

Click each config to see its attention output quality:

K/V config comparison — attention output MSE

—

Attention MSE

—

Verdict

The Safe Decision Tree

Model weights?
│
├── Q8_0 or higher
│   ├── Best:        -ctk turbo4 -ctv turbo4
│   └── More:        -ctk turbo3 -ctv turbo3
│
└── Q4_K_M
    ├── Qwen2.5     → MUST asymmetric: -ctk q8_0 -ctv turbo3
    ├── Unknown      → Safe default:  -ctk q8_0 -ctv turbo4
    └── Llama/Mistral 24B+ → Try symmetric, fallback to asymmetric

The safe universal default: keep K at q8_0, compress V as aggressively as your quality budget allows.

Module 6 — Advanced Optimizations

Module 6 · § 01

Boundary V: Not All Layers Are Equal

You know V can be compressed aggressively. But not every layer tolerates the same level of compression. The first 2 and last 2 layers are uniquely exposed:

Layers	What they do	Error behaviour
First 2	Raw tokens → first hidden state	Errors propagate through ALL remaining layers
Middle N-4	Feature transformation	Errors propagate through fewer layers — network self-corrects
Last 2	Hidden state → logit predictions	Errors go directly into token prediction — no recovery

Boundary V protects exactly those 4 layers with high precision V, compresses everything else aggressively:

layers 0, 1, N−2, N−1 → V at q8_0 (8.5 bits) all middle layers → V at turbo2 (2.5 bits)

Adjust the model depth to see how the compression cost and quality recovery change:

Boundary V — per-layer precision and compression analysis

layers: 32 layers

—

Boundary V bpv

—

Overhead vs turbo2

—

Layers protected

—

Quality recovery

Model	turbo2 PPL	Boundary V PPL	turbo3 PPL	Recovered
phi-4 (40 layers)	4.835	4.784	4.742	55%
Qwen2.5-7B (28 layers)	6.911	6.835	6.707	37%
Qwen3.5-35B MoE (64 layers)	5.257	5.148	5.137	91%

The 64-layer model recovers 91% of the quality gap because 4 boundary layers represent a smaller fraction of a deeper model — same protection, lower proportional cost.

Module 6 · § 02

Sparse V: Skip What Doesn't Matter

After computing softmax(QKᵀ/√d), you get attention weights over all tokens. At long context, most of those weights are negligibly small — the model is paying essentially zero attention to most past tokens.

If a token has weight 0.000001, its V vector contributes 0.000001 × V to the output — essentially nothing. Yet standard attention still dequantizes every V vector, loading all those indices and norm values from memory.

Sparse V skips dequantization for tokens below a threshold. The weights are computed from K before V is touched — so we already know which tokens to skip.

Attention weight distribution — how sparse is attention at long context?

context length: 1024 tokens

—

Skippable (w < 1e-6)

—

Top 5 tokens hold

≈ 0

Quality impact

Skipping 0.1% sounds tiny — but at 32K context that's 32 fewer vector loads per head per token. Memory bandwidth is the bottleneck at long context, not compute. Small reductions in loads give disproportionate speedups.

⚡ Real-world results

+22.8% decode throughput at 32K context on MoE models (M5 Max). Zero perplexity impact (validated at 32K, 50 chunks, wikitext-103). Sparse V actually improved NIAH retrieval (9/9 vs 7/9) — dequantizing near-zero positions may introduce tiny artifacts. Skipping them removes that noise.

Module 6 · § 03

Block Size 128: Free Compression

PolarQuant stores each vector as: indices (2/3/4 bits each) + one norm (2 bytes). The original implementation organised storage in 32-element blocks — each block getting its own norm.

For a 128-dimensional vector:

Block size 32:  128 dims / 32 = 4 blocks → 4 norms × 2 bytes = 8 bytes
But all 4 norms store the SAME VALUE — 3 are pure redundancy!

Block size 128: 128 dims / 128 = 1 block → 1 norm × 2 bytes = 2 bytes
Save 6 bytes per vector. Zero quality change.

Block size vs bits-per-value — eliminating redundant norms

3.375 bpv

Block size 32

3.125 bpv

Block size 128

+12%

Compression gain

Zero

Quality impact

Modules 5 & 6 · Summary

The Complete Optimization Stack

All six layers are orthogonal — they don't interfere with each other. Enable all six and you get all six benefits simultaneously.

🔄

PolarQuant Core

WHT rotation → N(0,1/d) → Lloyd-Max optimal codebook → norm handling

🎯

Algorithm Choice

turbo4 for K (no QJL). MSE-only PolarQuant for V. More centroids beats error correction.

⚖️

Asymmetric K/V

K at q8_0 — protected from softmax amplification. V compressed aggressively (turbo2–turbo4).

🛡️

Boundary V

First 2 + last 2 layers at q8_0-V. Middle layers at turbo2-V. Recovers 37–91% quality gap.

⚡

Sparse V

Skip dequant for attention weights below 1e-6. +22.8% decode speed. Zero quality loss.

📦

Block Size 128

One norm per vector instead of four. +12% compression for free. Zero quality impact.

Cumulative improvement — enabling optimizations one by one

Layer 0 / 6

1.0×

Compression

1.0×

Decode speed

Baseline

Quality

Final result	Value
KV cache compression	3.8–6.4×
PPL impact	+0.23% to +6.5%
Decode speed at 32K context	+22.8%
Memory saved at 128K / 70B	~5 GB