Back to news
braincontext engineeringMarch 18, 202618 min

KV cache, attention and the long context engineering problem

Pedro Sanches

Pedro Sanches

Senior Tech Lead · Accenture

KV cache, attention and the long context engineering problem

Long context engineering begins with a mathematical reality that practitioners frequently underestimate. Standard attention is quadratic: for a window of N tokens, computational cost grows as O(N²). A model with a 1 million token window running dense attention would have prohibitive memory requirements without fundamental optimizations. Understanding these optimizations is not implementation trivia. It is what allows reasoning about when and why long context systems fail, cost too much, or exhibit quality degradation in certain window regions.

The central operation of the transformer is self-attention. Given input X with n tokens, we generate three matrices via learned projections: Q = XW_Q, K = XW_K, V = XW_V. Attention is then Attention(Q, K, V) = softmax(QK^T / √d_k) V. The QK^T product produces an n×n matrix of attention scores. Each row represents the attention distribution of one token over all others. This matrix occupies O(N²) in memory, which for N=1M tokens would be 4TB just for attention scores in float32 before any operation.

FlashAttention solves the memory problem through memory hierarchy-aware tiling. Instead of materializing the complete attention matrix in HBM (High Bandwidth Memory), the algorithm divides Q, K, V into blocks that fit in on-chip SRAM, computes attention in blocks and accumulates the result without ever writing the complete N×N matrix to HBM. FlashAttention 2 improves sequence parallelism. FlashAttention 3 unifies operations for H100 Tensor Cores. The practical result: 2-4x speedup and 5-20x reduction in memory consumption for long sequences.

The KV cache is the mechanism that makes autoregressive inference efficient. In token-by-token generation, each new token needs to compute attention with all previous tokens. Without cache, each step would recompute K and V for the entire sequence. With KV cache, K and V vectors of already-processed tokens are stored in memory and reused. The marginal cost of each new token is O(N) instead of O(N²). For a 100K token window generating 1K new tokens, this represents 100x reduction in attention work.

But the KV cache creates its own engineering problem: memory management. Each token occupies 2 × n_heads × d_head × dtype bytes of cache (one K tensor and one V tensor). For GPT-3 with 96 heads of size 128 in FP16, each token occupies 96 × 128 × 2 × 2 = 48KB. A 128K token window occupies 6GB. A 1M token window occupies 48GB — the entire memory of an A100 80GB with no room for model weights. This explains why 1M token context length requires techniques like GQA (Grouped-Query Attention) and MQA (Multi-Query Attention), which reduce KV cache space by sharing heads across groups or completely.

The "Lost in the Middle" phenomenon (Liu et al., 2023) reveals that modern language models have systematic primacy and recency bias: information at the beginning and end of the window is processed more reliably. Information in the middle of the 1M token window suffers recall degradation. This has direct implications for context engineering system design. A well-designed RAG pipeline does not simply place the N most relevant documents in the window. It weights the position of each fragment. Information critical for reasoning should be near the beginning or near the final instruction. Background documents can be inserted in the middle without severe impact.

Context extension beyond training requires modifications to positional encodings. Models with RoPE (Rotary Position Embedding) can be extended via YaRN (Yet another RoPE extensioN), which dynamically scales rotation frequency to accommodate longer sequences. ALiBi (Attention with Linear Biases) does not encode position directly — instead, it adds a growing negative linear bias with distance between tokens, allowing cleaner extrapolation. The choice of positional encoding is an architectural decision with long-tail consequences: models with well-calibrated RoPE generalize well to sequences 4-8x larger than training with minimal fine-tuning; models with absolute positional encoding degrade rapidly beyond the training context.

Prefix caching is the technique that reduces inference cost for workloads with reusable prefix. When the system prompt, instructions, or context documents are identical across requests, the KV cache for those tokens can be computed once and reused. Anthropic implements this as Prompt Caching: context blocks marked with cache_control are stored for 5 minutes (default) or 1 hour (extended TTL), reducing cost by up to 90% for cached segments. The architectural implication is clear: reusable context (system prompts, ontologies, reference documents) should be placed at the beginning of the message and explicitly marked for caching. Variable context (queries, runtime data) should follow the fixed prefix.

Context compression allocates fewer tokens to represent the same semantics. RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) builds a hierarchical summary tree: text fragments are semantically clustered, each cluster is summarized, summaries are clustered again, and so on up to a root summary. The system can answer queries by navigating the correct level of the hierarchy — using the granular level for specific queries and the abstract level for queries about overall trends. This reduces cost and improves relevance simultaneously.

Long context engineering is therefore not a matter of simply increasing the window size. It is a budgeting discipline that considers: quadratic vs linear computational cost, strategic positioning of critical information, leveraging prefix caching to amortize fixed context cost, hierarchical compression for variable context, and quality degradation monitoring across different window regions. Systems that treat the context window as a simple memory buffer eventually pay in cost, latency or response quality.

#kv-cache#attention#contexto-longo#flashattention#rag#compressao
Pedro Sanches

Pedro Sanches

Senior Tech Lead · Accenture Brasil