On context windows and decomposition
Context windows will keep growing. We've gone from 4K to 128K to 200K tokens in a couple of years. Million-token windows are already in preview. The raw limit will matter less and less.
But bigger windows don't solve the underlying problem. The real constraint isn't storage. It's working memory. And the solution isn't more context. It's better decomposition.
Humans don't solve complex problems by loading everything into consciousness simultaneously. We decompose, work a piece, checkpoint, move on.
A chess grandmaster doesn't hold every possible game state in mind. They chunk patterns, focus on the relevant board position, reason about bounded possibilities. This isn't a limitation we overcome with more brainpower. It's how effective reasoning actually works.
Bounded scope. Clear focus. Verifiable progress.
You don't need to hold the entire codebase in mind. You need to identify the right tree, understand it fully, do your work, move on. Even with infinite context, you'd still want this structure. The constraint just makes the pattern mandatory rather than optional.
To be clear: bigger context windows are genuinely useful. More context means better recall, more nuance, fewer trips back to fetch information. Going from 4K to 128K tokens was a real unlock for many tasks.
But there's a theoretical ceiling, and it's worth understanding even as engineers find ways around it.
The theoretical trade-off. Transformers keep a "perfect" copy of the past by storing the entire history in memory (the KV cache). Every token can attend to every other token. That's powerful, but it scales quadratically with sequence length. Recurrent and state-space models (SSMs) are linear, but they compress history into a fixed-size state. That compression is inherently lossy. You can have perfect recall or infinite length, but not both.
The practical reality. Engineers don't care about "infinite" in the mathematical sense. They care about "long enough to be useful." And they're finding ways to cheat the trade-off.
Hybrid architectures like Jamba interleave SSM layers (for efficient processing of the "gist") with attention layers (for sharp associative recall of specific details). Techniques like Infini-attention compress old context into memory blocks that attention can still query. The line between "state" and "attention" is blurring.
We're moving away from a single "context window" toward a memory hierarchy: attention as L1 cache (perfect recall, expensive, immediate context), SSM/compression as RAM (high capacity, lossy, long-term thread), and RAG/vector databases as disk (infinite capacity, slow retrieval, cold storage).
But attention still degrades with scale. Even with these advances, models don't attend equally well to everything in a massive context. Research on "lost in the middle" effects shows measurably worse performance on information buried in the middle versus the beginning or end.
Signal stays clear in small windows.
More context creates harder retrieval problems. Having access to everything doesn't mean knowing what's relevant. A model with the entire codebase in context still has to decide which parts matter for the current task. That decision-making itself requires judgment and focus.
Paradoxically, constrained context can improve performance by forcing the question: "What specifically do I need for this one tree?"
The future isn't "bigger windows solve everything." It's smarter decomposition.
Retrieval over stuffing. Instead of loading everything, fetch what's relevant. RAG improves, embeddings get smarter, models learn to query for what they need. But effective retrieval requires knowing what to ask for. "Show me everything" isn't a query. "Show me the trees affected by this drainage pattern" is.
Hierarchical abstraction. Understanding at different granularities. The agent doesn't need the full codebase. It needs the right view at the right level: sometimes the individual tree, sometimes the stand, sometimes the ridgeline. Good decomposition means knowing which level matters for the current task.
Persistent external state. Memory systems, scratchpads, external stores the model can read and write. Context becomes a working buffer for active reasoning, not a warehouse for everything that might be relevant. The forester's notebook, not a library of every forestry text ever written.
Decomposition as a core capability. Models get better at breaking big problems into bounded subtasks. This isn't scaffolding we impose from outside. It becomes internalized as a reasoning strategy. The model learns to ask: "What's the smallest tree I can work on that moves this forward?"
Start embracing that many trees make a forest.
The question isn't "how much can I fit?" It's "what's the smallest unit of work that can succeed independently?" Get that right, and the bigger picture assembles itself. Get it wrong, and no amount of context will save you.
The skill to develop isn't "how do I give the agent more context?" It's "how do I decompose this problem into trees that don't need to see the whole forest?"
That's not a workaround for today's limitations. It's how complex work actually gets done, now and always.