The Dark Factory
I've been addicted to building recently. This happens when I have a clear line of sight. It happened with Overstory, and it's happening again with Warren. Warren has become a complete dark factory in the past few weeks. It's exceeding my expectations and reframing how I understand agents. Nowadays, agents are incredibly atomic. They're ephemeral, disposable, and hyper-focused. I've been operating that way for a while, but Warren has brought me fully out of the loop. I'd wager I'm interacting manually with 5-10% of the agents I spawn these days.
This works because the environment they are operating within supports autonomy to an extreme degree. Warren's agents can't do anything unless they do it correctly. Linting, testing, typechecking are all table stakes. I've now added sizing limits to files, performance benchmarking, nightly security and bug reviews that auto-heal the codebase while I sleep. It's a well oiled machine. An incredibly effective dark factory.
The reason it works is the environment the agents run inside. I've been thinking about this as agentic readiness. The version that actually matters comes down to a few things: every standard is mechanized instead of written down somewhere, every change is verifiable in seconds, and every action is reversible. Warren is a system that ships to production on its own and keeps getting better at shipping. Agents were never a problem you solve with more documentation. They're an environment problem. Get the environment right and they stop needing to be told anything.
The Wrong Roadmap, Flawlessly, Forever
As always, I'm looking towards the future. Since I first booted an LLM in my CLI over two years ago, I've been haunted by the feeling that we're nowhere close to the ceiling of what this technology can do. I feel that now more than ever. Warren, and the systems cutting edge companies are running, have opened my eyes again to what's coming. It blows my mind that we haven't seen a fast takeoff yet, but I'll save that thought for another day.
So I'm asking myself the same question I always do. What's next? If I have a system that executes flawlessly, what's left for me to do? Do I even interact with it anymore?
The Sixth Layer: Judgment
In my last post, I broke long-horizon agent state into five layers: execution, memory, intent, coordination, and orchestration. Warren implements all five and it's exactly why it runs as well as it does. But sitting here now, I think I missed one. There's a sixth layer, and it's the one that actually answers the "what's next" question.
Nothing checks whether the work was worth doing. The loop ends at "PR merges, next agent inherits the memory." Nothing watches what happens after the merge. Did it ship value, did it get reverted, did later work build on it or route around it? Warren can execute the wrong roadmap, flawlessly, forever.
So the sixth layer is judgment. Human judgment, of course, but also agent judgment. I as a human provide my judgment towards the aspects of the system which I'll be interacting with. Agents provide their judgment in the layers which I am not. For example, take agent performance. In the Warren database, I currently have 400k+ entries of agent logs from the past two weeks. That's too much for me to look at. It's not too much for agents. They can parse these logs, figure out what's blocking agent runs, where context is being wasted, where the prompts could be tweaked, etc. And I trust them to do so. I don't need to be involved in that layer of the system ever again if I don't want to be. And that's just one small layer.
As I provide my judgement to the surfaces which I'm interacting with, agents will remember my preferences. Using telemetry on the frontend, they can examine my usage patterns and literally build the system around how I use the system. I provide my taste by communicating it manually, but also passivley via simply using whats been built.
My concern is moving past cost-per-PR. Instead, it is how fast can the system get better at deciding what to build.
From Factory to Organism
This is the jump from a factory to something else. A factory just executes an order book. What I'm looking for both decides the order book and learns from the market. That's not a factory anymore. The honest word is organism: something that adapts to its environment instead of just running the line. That word breaks the whole factory frame I've been building on, and I think that break is the point. The thing that comes after a factory can't be named in the factory's vocabulary. So I'll let the metaphor break instead of patching it.
The cleanest way I can put it: a factory multiplies my labor across a thousand runs. What comes next compounds my taste across them. Labor spreads flat; taste accumulates and feeds on itself. It's linear versus exponential. Taste gets captured cheaply, accumulated durably, and re-applied automatically. That triad is the whole rest of this post.
Telemetry, Human Ratification, and Executable Criteria
So how do you actually feed outcomes back in? I think there are three tiers of signal, and they're not equal.
First, telemetry. Free, noisy, observable straight from git and the event log with zero effort on my part. PR merged, reverted within a week, CI stayed green, did anything get built on top of it. The trap is that it's gameable on its own. Optimize telemetry alone and agents learn to ship tiny, safe, reversible changes that move the numbers without being worth much.
Second, human ratification. This is ground truth. I mark developments with an outcome: worked, partial, or failed. That's the supervised label that keeps the telemetry honest. In the agentic networks post I called myself a heavily-weighted node in the graph. This is that weight made concrete. My ratification is the one signal the rest of the system bends around.
Third, executable success criteria. This is the real closed loop. The linting, benchmarks, and size ratchets I mentioned up top are already this, I just never thought of them as a value signal. They're criteria written as predicates the system can actually re-run after the merge. A unit of work with no measurable criterion is one I can't learn from.
Blast Radius × Reversibility × Novelty
You'd think running on telemetry alone is the risky tier. It's the opposite. At the volume Warren operates, I physically can't review everything. So human-only review is the blind spot. I, as the human, will become the weak link in these scenarios. Failures pile up unseen because the only sensor (me) can't keep up. Telemetry is a safety instrument, not a risk.
So what gates autonomy isn't how "important" something feels in the abstract. It's blast radius × reversibility × novelty. Low-blast, reversible, recurring work runs dark. The agent detects it, fixes it, and I never see it. High-blast, irreversible, novel work waits for me.
And the boundary isn't fixed. Every time I ratify or override, I'm labeling "this bucket was safe to hand off." So the line moves outward on its own, per-domain. The rule it settles into is simple: measurable work earns the right to run dark, unmeasurable work earns my attention. The predicate is the license.
The Prompt Dies, Dispatch Dies
Once you take all that seriously, the two fundamental aspects of agent interaction just die.
The prompt dies. Durable intent lives somewhere persistent, and a run derives its own prompt from that intent plus everything the system has accumulated in the past. I stop writing prompts and start curating intent. The entire process from intent -> impelementation collapses.
Dispatch dies too. There's no "spawn a run" anymore. There's "advance this piece of work." Manual, scheduled, and plan-based sequential agent runs all collapse into one way of moving a thing forward. The atomic unit was never the run. The run is just a cog in the machine. If I follow that far enough, the durable unit isn't even a single intent. It's the shape of an intent. The kind of work judged a certain way, that I can reuse and weight by how past versions of it turned out. Any given pursuit is just one instance. I'm not certain about this part yet, but it's what I'm building next.
The Canvas
If intent is the atom, the way I author intent matters more than anything. Right now I fill out structured fields: goal here, constraints there. That extracts labor from me, which is backwards.
The inversion: I brain-dump onto a blank canvas, the system distills the structure out of it (goal, constraints, candidate predicates) and hands it back as an editable projection. Then I correct it. That correction is the highest-quality taste signal in the whole system. It's literally the gap between what I said and what I meant. A blank canvas reads my taste better than a form ever could, because a form only captures what I already knew to write down.
A Single Operator
I should be upfront about something that makes all of this work: Warren is for me. Not for a market, not for users, for me. It's open source, but it is not a product and it's not built for other people. The human in the loop is always me, the taste is my taste. This is the answer to the obvious objection, the one about whether any of this needs to generalize. It doesn't. Compounding one person's taste is the entire point. I get to make opinionated single-operator choices instead of building junk I try to sell.
The Organism Learns
So that's where my head is. Three posts now, and they line up into an arc. The agentic networks post put me in the network as a node, the long-horizon state post gave agents state that outlives a run, and this one gives the system judgment about its own output.
The factory made the work cheap. This next thing makes the work worth doing.