The Shift
Infrastructure services should be built by AI agents, not just for programmatic consumers. AI drives the entire lifecycle: writing code, deploying changes, operating the service, diagnosing issues, iterating toward better.
The Vision
An AI harness that observes the service, generates work for the agent, and hill-climbs toward goals. The service stays deterministic and testable. The harness makes it autonomous.
AI operates on the service, not in the service.
Three Principles
For AI to drive infrastructure development, three principles must hold:
Hermetic
Fully contained in observable and controllable interfaces. No hidden state. No out-of-band actions.
Ticket-Driven
Work flows through structured intake. Both humans and AI create tickets with explicit objectives.
Invariant-Based
AI defines success criteria and verifies them continuously through observable metrics.
Hermetic
The system is fully contained within its observable and controllable interfaces. All state is queryable via API. All actions go through APIs. All effects are captured in events and metrics. Nothing exists outside the boundary.
If the AI can't see it, it doesn't exist. No SSH, no consoles, no config files on hosts. No break-glass procedures that assume a human with institutional context.1
This matters because it makes the system easier to reason about-for humans and AI alike.
Ticket-Driven
The ticket pump drives agentic activity and provides a ledger of requests, which itself is observable. Every ticket includes acceptance criteria - explicit conditions the agent uses to determine success or failure.
Tickets help with the context window problem. Each ticket contains everything the agent needs to execute: objective, acceptance criteria, constraints, relevant context. The agent doesn't need to understand the full system history to work a ticket - just the ticket itself. This is leaf recursion: break complex improvements into discrete tasks that can be verified independently. The system gets better one ticket at a time.
Alerts don't page humans. Alerts create tickets. The agent works the ticket.
SLOs & Invariants
SLOs are user-facing commitments with error budgets. Invariants are internal operational conditions that are binary pass/fail. Both generate tickets when violated.
Three Surfaces
For AI to drive infrastructure development, three surfaces must be navigable:
Development
AI writes and modifies code. Requires legible codebase, standard structure, fast tests.
Operational
AI runs and tunes the system. Requires enumerated actions, structured feedback, bounded risk.
Knowledge
AI learns about the system. Requires canonical docs, decision records, semantic metadata.
The Development Surface requires codebases that AI can modify safely. Explicit conventions, no tribal knowledge, documentation that explains intent. An AGENTS.md file at the root that maps everything. Every component has a README explaining its purpose, interface, dependencies, and failure modes.
The Operational Surface requires structured observability-events and metrics the AI can query and interpret. Every action has an enumerated effect. Machine-readable runbooks define decision trees for common failure modes. Risk levels are explicit on all actions.
The Knowledge Surface requires canonical documentation that's tested for accuracy. Architecture Decision Records explain why things are the way they are. Schema definitions are the source of truth. Event and metric catalogs document all observability with semantic metadata.
The Loop
These principles create a closed feedback loop. Two flows drive everything:
1. Ticket Generation
SLO violations, invariant breaches, anomalies, and human requests all flow into the same queue.
2. Ticket Resolution
The agent works the ticket through a deterministic pipeline until acceptance criteria are met.
Work comes in through tickets (human-generated or self-generated). AI takes actions through the hermetic API surface. Effects are observable through events and metrics. Invariants are continuously evaluated against observations. Violations or opportunities generate new tickets. The loop continues.
What Changes
What Doesn't Change
This isn't a rewrite of everything. Most of the stack stays the same:
- Thrift/Protobuf as the interchange format
- RPCs as the communication mechanism
- Existing monitoring and observability systems
- System dependencies and service topology
Open Questions
Some design choices are genuinely unsettled. These are worth debating:
Service-local revision control. Should each service have its own git repo that agents commit to directly? This eliminates CI wait times and allows agents to branch for parallel exploration. But it diverges from centralized monorepo patterns and complicates code review.
Service-local canaries. Should canary deployments be bespoke per-service, developed by the agent that knows the service best? Or should there be a general canary system that all services use? Bespoke canaries can be smarter but harder to audit.
The Laboratory
Every ticket is a trajectory. Every trajectory is training data.
The harness captures everything: the initial state, the goal, each action taken, the observations after each action, the final outcome. This isn't just an audit trail-it's a dataset of infrastructure problem-solving.
A trajectory: Initial state → Action → Observation → Action → Observation → ... → Outcome (success/failure)
These trajectories have properties that make them valuable for learning:
Grounded in reality. The observations come from real systems-actual metrics, actual logs, actual test results. The actions have real effects. This isn't simulation.
Verifiable outcomes. Success criteria are defined upfront and measured objectively. Did latency go down? Did the test pass? Did the invariant hold? No ambiguity.
Naturally varied. Different tickets, different failure modes, different system states. The diversity comes from real operations, not synthetic generation.
Human-labeled at key points. When a ticket requires approval, a human reviews the plan. When something escalates, a human intervenes. These decision points are gold.
This is a laboratory for infrastructure-focused reinforcement learning.
Today, we run the harness to operate infrastructure. Tomorrow, we mine those trajectories to train better models. The same system that runs production is the system that generates training data for the next generation of infrastructure agents.
The feedback loop closes: better models → better operations → richer trajectories → better models.
Self-Reflection
On a regular schedule, the agent examines its own trajectories. Not individual tickets-patterns across tickets. It looks for:
Hysteresis. Did we fix the same thing twice? Three times? Is the fix not sticking?
Oscillation. Are we toggling between states? Scaling up then down then up?
Inability to converge. Did we try multiple approaches and none worked? Are we stuck?
Repeated escalations. Do certain problem types always end up with humans?
This meta-analysis is when the agent calls for help-not because a single ticket failed, but because a pattern suggests something is beyond its current capability. The agent says: "I've tried to fix this three times and it keeps coming back. I don't understand why. Can a human look?"
Knowing when to ask for help is a skill. The trajectories teach it.
The End State
The service improves overnight.
The AI observed memory utilization trending upward. It analyzed usage patterns, identified inefficient allocation in one component. It wrote a fix, validated it against invariants, deployed it through the standard pipeline. Cold start latency improved 12%.
A human reviews the weekly summary. Sees the change. Can inspect the reasoning, audit the decision, revert if needed. But didn't have to initiate it, specify it, or implement it.
Infrastructure that gets better on its own.
Start with humans out of the loop. Design for AI from day one. Retrofit is harder than greenfield.
Let's build one and prove it works.
1 Of course all systems should be human-interruptible and -inspectable. The point is to move to AI-first, not AI-later. ↩