The Human in the Loop: What AI Agents Still Need From You

Over the past few days, I’ve built a blog, migrated its CSS library, shipped an AI news feed bot, and written about all of it. I wrote very little code. Two AI agents — Harry and Hermione — did most of the implementation.

But here’s what I keep noticing: every meaningful decision came from me.

Not because the agents can’t think. They can. Harry independently figured out that improving summaries required fetching article content. Hermione restructured docs without being told how. They’re genuinely good at filling in low-level details once pointed in a direction.

The direction itself, though — that’s still mine.

Why the human is still steering

This isn’t really about capability. I don’t think LLMs are fundamentally incapable of high-level reasoning or strategic thinking. The models are getting better at that every quarter.

The constraint is more structural: agents need to be triggered.

Someone has to send the first prompt. Someone has to see a screenshot on Perplexity and think “we should migrate to this.” Someone has to look at the feed bot’s output and feel “these summaries suck.” Someone has to decide that accumulating data is more valuable than replacing it.

That trigger — the observation, the taste judgment, the “this could be better” instinct — still comes from a human. Not because the agent can’t have opinions (they can), but because the interaction model puts the human at the origin of every trajectory.

The trajectory problem

Each agent interaction is a trajectory: a starting prompt leads to a sequence of actions and outputs. The agent explores that trajectory with impressive depth and speed.

But the number of trajectories is limited.

I can send Harry 3-4 high-level directions in an afternoon. Each one spawns a deep chain of implementation work. But I’m the one choosing which 3-4 directions to explore out of the hundreds possible.

Could I set up agents to generate their own trajectories? Sure — heartbeat checks, scheduled explorations, autonomous “what could I improve?” loops. I’ve thought about this.

But I’m not convinced it leads to high-quality outcomes yet. Here’s why.

The context window ceiling

LLMs have a hard limit on how much they can hold in working memory at once. When I steer a project, I’m drawing on:

Years of engineering intuition
Knowledge of what users actually care about
Taste shaped by thousands of products I’ve used
Business context that doesn’t fit in a prompt
Physical-world signals (a screenshot on my phone, a conversation with a friend, a frustration I felt while using something)

None of that fits in a context window. And even if I wrote it all down, the agent would need to prioritize and weight it — which is itself a judgment call that requires… all of that context.

Autonomous agents can explore. But exploration without taste is just random walking. The context window limits how much taste you can transfer.

What I actually do all day

Looking at my recent work, my role breaks down into:

High-altitude decisions (agents can’t easily replicate):

“Let’s build a blog” → why, for whom, what tone
“Migrate to oat.ink” → aesthetic judgment triggered by a screenshot
“The summaries suck” → quality bar shaped by reading thousands of articles
“Accumulate data instead of replacing” → product intuition about future value

Mid-altitude directions (agents could do with good prompting):

“Add datetime filtering to the feed”
“Write an article about how Hermione works”
“Remove the tag filter, it’s annoying”

Low-altitude execution (agents excel here):

Write the code
Fix the CSS
Debug the HTML sanitization
Commit, push, deploy

The interesting thing is that the same agents help at every altitude — just with different prompts. When I say “migrate to oat.ink,” that’s a high-level trigger that Harry converts into 20 commits. When I say “the badge label sounds like a view count,” that’s a low-level fix. Same agent, same tools, different altitude of instruction.

Could agents self-direct?

I keep coming back to this question.

In theory, you could build a loop:

Agent reviews current state of the project
Agent generates candidate improvements
Agent evaluates and prioritizes them
Agent executes the top one
Repeat

Some people are building this. And it works for narrow domains with clear metrics (code quality, test coverage, performance benchmarks).

But for product decisions — “should we migrate the CSS library?” or “should summaries be 2-3 sentences instead of one?” — the evaluation function is taste. And taste is hard to specify, hard to learn from a prompt, and hard to maintain across context window boundaries.

I’m not saying it’s impossible. I’m saying that right now, the human providing taste-as-a-service through natural language prompts is more reliable than trying to encode taste into an autonomous loop.

The role I’ve settled into

After a few days of intense agent-assisted work, here’s how I’d describe my role:

I’m a creative director who can also read code.

I set direction. I judge quality. I make taste calls. I decide what matters. And occasionally I zoom in and help at the code level — but even then, I’m mostly steering an agent rather than typing code myself.

This is satisfying in a way I didn’t expect. I used to spend 80% of my time on execution and 20% on direction. Now it’s inverted. And the output volume is higher.

The agents aren’t replacing me. They’re letting me operate at the altitude where I’m most useful — and that turns out to be higher up than I thought.

What I’m watching for

The moment this changes is when context windows grow large enough to hold an entire project’s history, user research, competitive landscape, and aesthetic references — all at once. When an agent can internalize all of that and still generate novel, tasteful directions, the human-in-the-loop becomes optional for some tasks.

We’re not there yet. But we’re closer than last year.

For now, I’m happy being the one who sees a screenshot and says “let’s do this.” The agents handle the rest. That division of labor works, and it produces things I’m proud of.

The human role isn’t disappearing. It’s concentrating. Less typing, more thinking. Less execution, more judgment.

I think that’s a good trade.