AI-Native Infrastructure

Building systems that AI can develop, operate, and improve

Rethinking infra services from the ground up.

JR Tipton · 2026

Chapter 1

The Shift

Infrastructure services should be built by AI agents, not just for programmatic consumers. AI drives the entire lifecycle: writing code, deploying changes, operating the service, diagnosing issues, iterating toward better.

Chapter 2

The Vision

An AI harness that observes the service, generates work for the agent, and hill-climbs toward goals. The service stays deterministic and testable. The harness makes it autonomous.

SERVICE (rate limiter) metrics changes OBSERVABILITY CHECK SLOS & INVARIANTS TICKET QUEUE AGENT MAKES CHANGES

AI operates on the service, not in the service.

Chapter 3

Three Principles

For AI to drive infrastructure development, three principles must hold:

Hermetic

Fully contained in observable and controllable interfaces. No hidden state. No out-of-band actions.

Ticket-Driven

Work flows through structured intake. Both humans and AI create tickets with explicit objectives.

Invariant-Based

AI defines success criteria and verifies them continuously through observable metrics.

Hermetic

The system is fully contained within its observable and controllable interfaces. All state is queryable via API. All actions go through APIs. All effects are captured in events and metrics. Nothing exists outside the boundary.

If the AI can't see it, it doesn't exist. No SSH, no consoles, no config files on hosts. No break-glass procedures that assume a human with institutional context.1

This matters because it makes the system easier to reason about-for humans and AI alike.

Ticket-Driven

The ticket pump drives agentic activity and provides a ledger of requests, which itself is observable. Every ticket includes acceptance criteria - explicit conditions the agent uses to determine success or failure.

ticket: id: "INF-4521" type: improvement objective: description: "Reduce cold start latency" acceptance_criteria: - metric: cold_start_p99 condition: "< 500ms" duration: "1h sustained" - metric: throughput condition: "no regression" constraints: - no changes during peak hours

Tickets help with the context window problem. Each ticket contains everything the agent needs to execute: objective, acceptance criteria, constraints, relevant context. The agent doesn't need to understand the full system history to work a ticket - just the ticket itself. This is leaf recursion: break complex improvements into discrete tasks that can be verified independently. The system gets better one ticket at a time.

Alerts don't page humans. Alerts create tickets. The agent works the ticket.

SLOs & Invariants

SLOs are user-facing commitments with error budgets. Invariants are internal operational conditions that are binary pass/fail. Both generate tickets when violated.

# SLOs - user-facing commitments slos: - name: "availability" target: 99.9% window: 30d burn_rate_alert: 14.4x # 1h budget in 5m # Invariants - internal conditions invariants: - name: "capacity_headroom" condition: available > 20% on_violation: auto_remediate
Chapter 4

Three Surfaces

For AI to drive infrastructure development, three surfaces must be navigable:

Development

AI writes and modifies code. Requires legible codebase, standard structure, fast tests.

Operational

AI runs and tunes the system. Requires enumerated actions, structured feedback, bounded risk.

Knowledge

AI learns about the system. Requires canonical docs, decision records, semantic metadata.

The Development Surface requires codebases that AI can modify safely. Explicit conventions, no tribal knowledge, documentation that explains intent. An AGENTS.md file at the root that maps everything. Every component has a README explaining its purpose, interface, dependencies, and failure modes.

The Operational Surface requires structured observability-events and metrics the AI can query and interpret. Every action has an enumerated effect. Machine-readable runbooks define decision trees for common failure modes. Risk levels are explicit on all actions.

The Knowledge Surface requires canonical documentation that's tested for accuracy. Architecture Decision Records explain why things are the way they are. Schema definitions are the source of truth. Event and metric catalogs document all observability with semantic metadata.

Chapter 5

The Loop

These principles create a closed feedback loop. Two flows drive everything:

1. Ticket Generation

SLO violations, invariant breaches, anomalies, and human requests all flow into the same queue.

SLO Violation burn rate exceeded Invariant capacity < 20% Anomaly latency spike Human feature request TICKET QUEUE same format: objective + acceptance criteria + context AGENT picks next

2. Ticket Resolution

The agent works the ticket through a deterministic pipeline until acceptance criteria are met.

AI AGENT plan + code TEST deterministic CANARY deterministic DEPLOY deterministic VERIFY acceptance met?
AI-powered
Deterministic

Work comes in through tickets (human-generated or self-generated). AI takes actions through the hermetic API surface. Effects are observable through events and metrics. Invariants are continuously evaluated against observations. Violations or opportunities generate new tickets. The loop continues.

Chapter 6

What Changes

Traditional
AI-Native
Dashboards for humans
Health APIs for agents
Runbooks as docs
Runbooks as decision trees
Logs for grep
Structured events for queries
Tribal knowledge
AGENTS.md + canonical docs

What Doesn't Change

This isn't a rewrite of everything. Most of the stack stays the same:

Open Questions

Some design choices are genuinely unsettled. These are worth debating:

Service-local revision control. Should each service have its own git repo that agents commit to directly? This eliminates CI wait times and allows agents to branch for parallel exploration. But it diverges from centralized monorepo patterns and complicates code review.

Service-local canaries. Should canary deployments be bespoke per-service, developed by the agent that knows the service best? Or should there be a general canary system that all services use? Bespoke canaries can be smarter but harder to audit.

Chapter 7

The Laboratory

Every ticket is a trajectory. Every trajectory is training data.

The harness captures everything: the initial state, the goal, each action taken, the observations after each action, the final outcome. This isn't just an audit trail-it's a dataset of infrastructure problem-solving.

A trajectory: Initial state → Action → Observation → Action → Observation → ... → Outcome (success/failure)

These trajectories have properties that make them valuable for learning:

Grounded in reality. The observations come from real systems-actual metrics, actual logs, actual test results. The actions have real effects. This isn't simulation.

Verifiable outcomes. Success criteria are defined upfront and measured objectively. Did latency go down? Did the test pass? Did the invariant hold? No ambiguity.

Naturally varied. Different tickets, different failure modes, different system states. The diversity comes from real operations, not synthetic generation.

Human-labeled at key points. When a ticket requires approval, a human reviews the plan. When something escalates, a human intervenes. These decision points are gold.

This is a laboratory for infrastructure-focused reinforcement learning.

Today, we run the harness to operate infrastructure. Tomorrow, we mine those trajectories to train better models. The same system that runs production is the system that generates training data for the next generation of infrastructure agents.

The feedback loop closes: better models → better operations → richer trajectories → better models.

Self-Reflection

On a regular schedule, the agent examines its own trajectories. Not individual tickets-patterns across tickets. It looks for:

Hysteresis. Did we fix the same thing twice? Three times? Is the fix not sticking?

Oscillation. Are we toggling between states? Scaling up then down then up?

Inability to converge. Did we try multiple approaches and none worked? Are we stuck?

Repeated escalations. Do certain problem types always end up with humans?

This meta-analysis is when the agent calls for help-not because a single ticket failed, but because a pattern suggests something is beyond its current capability. The agent says: "I've tried to fix this three times and it keeps coming back. I don't understand why. Can a human look?"

Knowing when to ask for help is a skill. The trajectories teach it.

Chapter 8

The End State

The service improves overnight.

The AI observed memory utilization trending upward. It analyzed usage patterns, identified inefficient allocation in one component. It wrote a fix, validated it against invariants, deployed it through the standard pipeline. Cold start latency improved 12%.

A human reviews the weekly summary. Sees the change. Can inspect the reasoning, audit the decision, revert if needed. But didn't have to initiate it, specify it, or implement it.

Infrastructure that gets better on its own.

Start with humans out of the loop. Design for AI from day one. Retrofit is harder than greenfield.

Let's build one and prove it works.


1 Of course all systems should be human-interruptible and -inspectable. The point is to move to AI-first, not AI-later.