Agents in the Workshop, Not in the Product

Agent capability keeps going up.

And yet, when I look back at what I’ve actually shipped, there’s something I have to admit: I have never built a product with an agent inside it.

That’s the gap I want to start writing about.

Two very different meanings of “agents in the loop”

When people say “agents in the loop,” they usually mean one of two things, and they’re very different problems.

1) Agents in the loop of how the product gets built. This is the world I live in every day. Claude Code writes code while I review. An LLM drafts translations for this blog. Another one helps me think through architecture. The building process has agents threaded through it.

Claude Code, honestly, is one of the best agent-in-the-loop products I’ve ever used. An agent really is in the loop there—of my work.

2) Agents in the loop of how the product runs. This is the world where, after you ship, agents are still in there: reasoning, deciding, acting on behalf of users, in production.

The first one is already transformative. The second one—at least for me—has barely started.

What I’ve actually shipped

Let me be concrete about my own track record.

  • ai-sota-feed-bot was the first product I built using Claude Code. It’s a static feed generator. The pipeline was built with agents, but what users interact with is pre-rendered content. No agent in the runtime.
  • This blog is the same story. Agents polish, translate, and adapt every article. But the reader lands on plain HTML.

My agent usage has been heavily biased toward the workshop. Not the product.

I don’t think this stays this way

I’m pretty sure the future is multiple agents acting on behalf of humans—scheduling, filtering, negotiating, summarizing, tutoring, intervening. Not as demos. As load-bearing parts of real products that people depend on.

This is just the beginning.

A concrete early example: AWS recently described a DevOps agent that autonomously investigates production incidents and drives the response. That’s an agent sitting directly in a live operations loop—not in the workshop.

And the surface area we’re about to add to our society is enormous: agents making decisions, agents spending money, agents talking to other agents. I don’t think most of us—myself very much included—know how to build for that yet.

Why it’s still hard to ship an agent in a product

Here’s the trap I keep seeing (and falling into).

You wire up an agent. The first demo is seductive. It does something that looks like reasoning. You’re impressed. Whoever you show it to is impressed.

Then you try to make it real, and the cracks show up fast:

  • Outputs are too complex and too generalized. Plausible-sounding, but not anchored in what the user actually asked for.
  • No verification. The system has no way to know whether it did the right thing. Neither do you, once there are more than ten requests a day.
  • Cost is invisible until the bill arrives. Prompt length, retries, context size, tool loops—cost moves non-linearly in ways a normal dashboard doesn’t catch.
  • Performance is invisible too. Tail latency, partial failures, timeouts, model regressions—none of this shows up naturally in how we usually monitor products.

These aren’t nice-to-haves. They’re the difference between “we have an agent somewhere in our product” and “customers can actually depend on this.”

And right now, there is no easy way to assemble all of that. You stitch fragments together—eval tools, prompt registries, tracing, cost meters, guardrails—and hope the seams don’t leak.

Without those components, an agent system is nowhere close to production-usable.

Agents need to become like any other machine system

Every machine system we’ve previously shipped into society—payments, search, messaging, maps, recommendation—had to become trustable and consistent before it became infrastructure.

It wasn’t enough for them to be smart. They had to become:

  • Bounded — you know what they will and won’t do
  • Observable — you can see what they’re doing and what they cost
  • Recoverable — when they fail, something sensible happens
  • Attributable — you can tell why a decision was made

Agents aren’t there yet. Not because the models can’t—because the systems around the models mostly don’t exist at product-grade yet.

If agents are going to live alongside us in the same way payments and search do, this is the layer that has to get built. Consistent outputs. Trustable behavior. Observable cost. Recoverable failures.

Not “smart.” Dependable.

Why this section exists

So here’s what Agents in the Loop is for.

It’s not about using agents to code faster. That belongs in the other sections, and I’ll keep writing about it.

This section is about what happens when I try to move an agent from the workshop into the product itself: what breaks, what’s load-bearing, what the production-grade pieces actually look like, and what I learn trying to build them.

I don’t have clean answers yet. I have a clear sense of the gap, a few hypotheses, and a strong hunch that this is where the next era of product building gets figured out.

Let’s find out.