AI-Native Infrastructure

Chapter 1

The Shift

Infrastructure services should be built by AI agents, not just for programmatic consumers. AI drives the entire lifecycle: writing code, deploying changes, operating the service, diagnosing issues, iterating toward better.

Chapter 2

The Vision

An AI harness that observes the service, generates work for the agent, and hill-climbs toward goals. The service stays deterministic and testable. The harness makes it autonomous.

AI operates on the service, not in the service.

Chapter 3

Three Principles

For AI to drive infrastructure development, three principles must hold:

Hermetic

Fully contained in observable and controllable interfaces. No hidden state. No out-of-band actions.

Ticket-Driven

Work flows through structured intake. Both humans and AI create tickets with explicit objectives.

Invariant-Based

AI defines success criteria and verifies them continuously through observable metrics.

Hermetic

The system is fully contained within its observable and controllable interfaces. All state is queryable via API. All actions go through APIs. All effects are captured in events and metrics. Nothing exists outside the boundary.

If the AI can't see it, it doesn't exist. No SSH, no consoles, no config files on hosts. No break-glass procedures that assume a human with institutional context.¹

This matters because it makes the system easier to reason about-for humans and AI alike.

Ticket-Driven

The ticket pump drives agentic activity and provides a ledger of requests, which itself is observable. Every ticket includes acceptance criteria - explicit conditions the agent uses to determine success or failure.

ticket:
  id: "INF-4521"
  type: improvement
  objective:
    description: "Reduce cold start latency"
    acceptance_criteria:
      - metric: cold_start_p99
        condition: "< 500ms"
        duration: "1h sustained"
      - metric: throughput
        condition: "no regression"
  constraints:
    - no changes during peak hours
            

Tickets help with the context window problem. Each ticket contains everything the agent needs to execute: objective, acceptance criteria, constraints, relevant context. The agent doesn't need to understand the full system history to work a ticket - just the ticket itself. This is leaf recursion: break complex improvements into discrete tasks that can be verified independently. The system gets better one ticket at a time.

Alerts don't page humans. Alerts create tickets. The agent works the ticket.

SLOs & Invariants

SLOs are user-facing commitments with error budgets. Invariants are internal operational conditions that are binary pass/fail. Both generate tickets when violated.

# SLOs - user-facing commitments
slos:
  - name: "availability"
    target: 99.9%
    window: 30d
    burn_rate_alert: 14.4x  # 1h budget in 5m

# Invariants - internal conditions
invariants:
  - name: "capacity_headroom"
    condition: available > 20%
    on_violation: auto_remediate
            

Chapter 4

Three Surfaces

For AI to drive infrastructure development, three surfaces must be navigable:

Development

AI writes and modifies code. Requires legible codebase, standard structure, fast tests.

Operational

AI runs and tunes the system. Requires enumerated actions, structured feedback, bounded risk.

Knowledge

AI learns about the system. Requires canonical docs, decision records, semantic metadata.

The Development Surface requires codebases that AI can modify safely. Explicit conventions, no tribal knowledge, documentation that explains intent. An AGENTS.md file at the root that maps everything. Every component has a README explaining its purpose, interface, dependencies, and failure modes.

The Operational Surface requires structured observability-events and metrics the AI can query and interpret. Every action has an enumerated effect. Machine-readable runbooks define decision trees for common failure modes. Risk levels are explicit on all actions.

The Knowledge Surface requires canonical documentation that's tested for accuracy. Architecture Decision Records explain why things are the way they are. Schema definitions are the source of truth. Event and metric catalogs document all observability with semantic metadata.

Chapter 5

The Loop

These principles create a closed feedback loop. Two flows drive everything:

1. Ticket Generation

SLO violations, invariant breaches, anomalies, and human requests all flow into the same queue.

2. Ticket Resolution

The agent works the ticket through a deterministic pipeline until acceptance criteria are met.

AI-powered

Deterministic

Work comes in through tickets (human-generated or self-generated). AI takes actions through the hermetic API surface. Effects are observable through events and metrics. Invariants are continuously evaluated against observations. Violations or opportunities generate new tickets. The loop continues.

Chapter 6

What Changes

Dashboards for humans

Health APIs for agents

Runbooks as docs

Runbooks as decision trees

Logs for grep

Structured events for queries

Tribal knowledge

AGENTS.md + canonical docs

What Doesn't Change

This isn't a rewrite of everything. Most of the stack stays the same:

Thrift/Protobuf as the interchange format
RPCs as the communication mechanism
Existing monitoring and observability systems
System dependencies and service topology

Open Questions

Some design choices are genuinely unsettled. These are worth debating:

Service-local revision control. Should each service have its own git repo that agents commit to directly? This eliminates CI wait times and allows agents to branch for parallel exploration. But it diverges from centralized monorepo patterns and complicates code review.

Service-local canaries. Should canary deployments be bespoke per-service, developed by the agent that knows the service best? Or should there be a general canary system that all services use? Bespoke canaries can be smarter but harder to audit.

Chapter 7

The Laboratory

Every ticket is a trajectory. Every trajectory is training data.

The harness captures everything: the initial state, the goal, each action taken, the observations after each action, the final outcome. This isn't just an audit trail-it's a dataset of infrastructure problem-solving.

A trajectory: Initial state → Action → Observation → Action → Observation → ... → Outcome (success/failure)

These trajectories have properties that make them valuable for learning:

Grounded in reality. The observations come from real systems-actual metrics, actual logs, actual test results. The actions have real effects. This isn't simulation.

Verifiable outcomes. Success criteria are defined upfront and measured objectively. Did latency go down? Did the test pass? Did the invariant hold? No ambiguity.

Naturally varied. Different tickets, different failure modes, different system states. The diversity comes from real operations, not synthetic generation.

Human-labeled at key points. When a ticket requires approval, a human reviews the plan. When something escalates, a human intervenes. These decision points are gold.

This is a laboratory for infrastructure-focused reinforcement learning.

Today, we run the harness to operate infrastructure. Tomorrow, we mine those trajectories to train better models. The same system that runs production is the system that generates training data for the next generation of infrastructure agents.

The feedback loop closes: better models → better operations → richer trajectories → better models.

Self-Reflection

On a regular schedule, the agent examines its own trajectories. Not individual tickets-patterns across tickets. It looks for:

Hysteresis. Did we fix the same thing twice? Three times? Is the fix not sticking?

Oscillation. Are we toggling between states? Scaling up then down then up?

Inability to converge. Did we try multiple approaches and none worked? Are we stuck?

Repeated escalations. Do certain problem types always end up with humans?

This meta-analysis is when the agent calls for help-not because a single ticket failed, but because a pattern suggests something is beyond its current capability. The agent says: "I've tried to fix this three times and it keeps coming back. I don't understand why. Can a human look?"

Knowing when to ask for help is a skill. The trajectories teach it.

Chapter 8

The End State

The service improves overnight.

The AI observed memory utilization trending upward. It analyzed usage patterns, identified inefficient allocation in one component. It wrote a fix, validated it against invariants, deployed it through the standard pipeline. Cold start latency improved 12%.

A human reviews the weekly summary. Sees the change. Can inspect the reasoning, audit the decision, revert if needed. But didn't have to initiate it, specify it, or implement it.

Infrastructure that gets better on its own.

Start with humans out of the loop. Design for AI from day one. Retrofit is harder than greenfield.

Let's build one and prove it works.