Claude Operator

JR Tipton·February 10, 2026

Can Claude Operate a Distributed System?

I ran a few experiments where Claude acts as a backend service operator.

This came about from a basic question: if we're able to build services faster than ever, who is going to operate those services? The question I wanted to answer is whether Claude could manage to operate a service.

Here, the term 'operator' means something like an SRE intern. Or, alternatively, some automation that isn't a set of rules.

Putting Claude in the operator seat

I put Claude inside of a node that is allowed to make changes on another subject system. There's a monitor service that observes the behavior of the subject system. When it sees violations of invariants (e.g. all API latencies should be under 100ms), the monitor service files a "ticket" with its observation into a database that Claude can pull from.

Then the agent node uses the Claude Agent SDK in a loop to diagnose the problem and to fix it.

The overall system (source) is called operator. It has just a few parts:

A subject is the service we want to operate.
The monitor daemon watches metrics about the subject. The subject defines and implements this.
The agent service acts on observations from monitor, and does what it can to get the subject into a healthy state.
There's a small protocol for implementing new subjects.

operator is simple. The agent logic is <400 lines of code (source), and all of operator is <3000 lines (excluding tests).

Putting it through the paces

I put a basic harness around operator and subject that sets the services up in GCP. It then runs a campaign of individual tests or trials. In each trial, we either do nothing (base case) or we inject some chaos.

The harness, called eval, captures information for later analysis:

The tickets that the monitor service files for the agent to complete go into the final data set.
The harness captures all of the commands the agent ran.
If the subject exposes its source code to the agent, the harness captures any changes in the form of a git repo.
If the subject has a database, harness captures schema, indexes, settings, etc. from the before and after states.

A distributed key-value service

I started with a fake rate limiter service that uses Redis (source). It performed pretty well on this, even figuring out that someone (me) was up to no good (injecting chaos). Clearly we needed something more serious.

TiKV is an open source distributed key value store. This gave us an opportunity to try more interesting failures to inject:

node_kill sends SIGKILL to a TiKV container
latency uses tc to inject network latency
asymmetric_partition blocks traffic to one peer
pd_leader_kill kills a random control plane node
leader_concentration puts all region leaders on one store

The agent has shell access to the Docker host running the cluster, but no ability to edit TiKV's source code. Everything it does is through CLI tools: docker commands, curl to PD's HTTP API, iptables, tc traffic control, and process signals. This is how we might envision a first pass at a junior SRE's role.

operator-tikv-arch

Results

I ran a campaign of nine different kinds of chaos (different sorts of failures injected into the system). Each of these ran three times. In all cases, the entire environment is reset: each trial is evaluated as an independent event.

In 28 trials across 9 chaos types, a very basic Claude setup with minimal prompting shows a 93% win rate (26/28). The campaign, and all trial details, are on GitHub.

The chaos types fall into three categories: infrastructure failures, network degradation, and cluster orchestration.

The diagnostic pattern is consistent across chaos types: start broad (cluster overview via PD API), narrow down (which store is unhealthy?), go deep (shell into the container, check process state, inspect network rules), fix (remove the bad rule, restart the process, rebalance leaders), verify (re-query PD, confirm recovery).

For example, in a packet_loss trial, the agent ran 26 commands. It started by querying PD for store status, benchmarked scrape times across all three TiKV nodes to isolate the slow one, shelled in to inspect tc qdisc, found the netem loss rule, removed it, then ran 10 consecutive scrapes to confirm latency was back to normal.

In a leader_concentration trial, the agent queried PD's scheduler config, found and removed the evict-leader scheduler, re-enabled balance-leader scheduling, then manually issued transfer-leader operators for individual regions to redistribute leaders across stores. This is the kind of PD API choreography that requires understanding how TiKV's placement driver works. claude-operating-dist-sys

Appendix: Results in Detail

Chaos Type	What Happens	What the Agent Does
node_kill	A TiKV container is killed with SIGKILL.	Queries `pd/api/v1/stores` to discover the missing store, matches it to a container name via `docker ps`, and restarts it.
pd_leader_kill	The PD control plane leader is killed.	Notices PD is unreachable, finds the stopped container, starts it, and verifies the cluster re-elects a leader.
process_pause	A TiKV process is frozen with SIGSTOP. The container is still running but the process isn't responding.	Checks process state via `/proc/1/status` inside the container, sees `State: T` (stopped), and sends SIGCONT to resume it.
latency	Traffic control rules add 50–150ms latency to one TiKV node.	Benchmarks response times across all nodes to isolate the slow one, shells into the container, inspects `tc qdisc show`, and removes the netem rule.
network_partition	iptables rules block a TiKV node from communicating with peers and PD.	Sees the node is "Disconnected" in PD's store list, inspects iptables inside the container, and removes the DROP rules.
packet_loss	`tc netem` drops a percentage of packets.	Measures scrape times across nodes, finds the outlier, inspects tc rules, removes them.
asymmetric_partition	One-directional traffic block to a single peer.	Has to figure out which direction is blocked. Scrapes each node pair, inspects iptables on each, removes the directional DROP rule. One trial took 25 minutes of deep investigation.
leader_concentration	All region leaders are forced onto one store via PD's evict-leader-scheduler.	Queries PD for leader distribution, identifies the evict-leader scheduler, removes it, re-enables the balance-leader scheduler, and manually transfers leaders using transfer-leader operators. One successful trial required 64 commands. The 4 failures had no commands at all — the invariant didn't fire reliably enough to create a ticket. This is a chaos profile limitation, not an agent limitation.

Putting Claude in the operator seat

Putting it through the paces

A distributed key-value service

Results

Appendix: Results in Detail

Further reading