The Spec That Nearly Shipped the Wrong Thing

Six weeks into a contract at a scale-up running a document verification platform, Maya was ahead of schedule and confident. She had a strong brief, a capable agent, and a clear sense of what the next sprint needed to deliver. By Friday afternoon, she had working code. By Monday morning, she had a problem.

The agent had done exactly what she had asked. That was the issue.

What the brief said and what the agent heard

The task was to build a retry mechanism for failed document uploads, a feature that should have been straightforward. Maya's spec described what the feature needed to do: detect upload failures, queue them for retry, and surface status updates to the user. It was clear, well-intentioned, and about three decisions short of being complete.

The agent built the feature. It also made its own choices about retry intervals, failure thresholds, and how to handle timeouts that coincided with the user closing the browser tab. None of those choices were wrong, exactly. They were just not the choices the team would have made if anyone had asked.

When Maya reviewed the output, the code was clean and the logic was coherent. But it had diverged from the behaviour the product team expected at precisely the points where the spec had been silent. Undoing it wasn't catastrophic, but it cost most of a day and a refactor that shouldn't have been necessary.

She wrote a short post-mortem, mostly for herself. The central line read: "I wrote a description. I needed to write an instruction set."

“The agent had done exactly what she'd asked. That was the issue.”

Rewriting the spec

The second version of the spec for that feature looked very different from the first. Not longer, necessarily, but more precise in its decisions.

Where the first version had said "surface status updates to the user," the second specified the states the UI needed to show, the sequence in which they could appear, and what the agent should do if a state transition was ambiguous. Where it had been silent on retry intervals, it named them and explained the constraint they were working against: a third-party API with rate limits that the agent had no way of knowing about.

The change that made the most difference was the simplest one: Maya added a line to each task that specified the done condition. Not "implement retry logic," but: "Retry logic is complete when: failed uploads are queued within 200ms of failure detection, the retry interval follows the specified backoff sequence, and the UI shows distinct states for queued, retrying, and permanently failed." The agent had a target to hit, not a direction to move in.

She also added two explicit pause points, places where the agent should stop and wait for review rather than continue to the next task. The first was after the queueing logic was in place, before anything touched the UI. The second was after the first full integration with the third-party API. Both were points where a wrong decision would compound. Both were places where fifteen minutes of human review was worth more than continued agent progress.

“An agent that hits genuine ambiguity will default to continuing. That default is not always wrong, but it should be a choice, not an accident.”

The decomposition question

One of the things Maya adjusted was the size of the tasks she handed to the agent. Her initial instinct, carried over from writing sprint tickets for human developers, was to create tasks at the level of a feature or a sub-feature. That level of granularity works when the developer reading the ticket can ask questions, notice problems, and exercise judgement in the gaps.

It does not work well when you need to inspect the output before moving on.

The retry mechanism ended up as nine tasks rather than two. Each one had a clear input, a clear output, and a done condition that could be checked without reading into the next task. This felt like over-engineering until Maya ran the second version and found that catching a misalignment at task four, before it had propagated into tasks five through nine, was qualitatively different from catching it at the end.

The grain of the work had to match the grain of the verification. Once she understood that, the decomposition decisions became easier.

Where the agent ran and where Maya didn't let it

Not every task got the same level of oversight. This was a deliberate choice, not an afterthought.

The low-ambiguity, easily reversible work — writing unit tests, generating the API client, building the queue data structure — ran with minimal review. If the output was wrong, it would be immediately obvious and quick to fix. The agent could move fast and Maya could scan the output rather than study it.

The tasks touching user-facing state, error handling, and the integration boundary got a human checkpoint before the next task could begin. These were places where a wrong decision would take longer to surface and longer to fix. They were also the places where the business logic was most entangled with context the agent didn't have: the product team's instincts about error messaging, a known edge case in the upstream API that wasn't documented anywhere publicly.

Maya's rule, refined over several engagements, was this: anything easily reversible can run; anything that touches a boundary between the system and the user, between the system and a third party, between this task and a downstream dependency, gets a checkpoint. It is not a sign of distrust in the tools. It is the thing that makes running them reliably possible.

“The developers getting the most from agentic tools have made explicit decisions about oversight at the task level — not a blanket policy, but a considered choice per type of work.”

What the team took from it

By the end of the contract, the way Maya planned work had become a reference point for two other developers on the team. Not because she had formalised it into a process, but because the difference in output quality was visible enough to prompt questions.

One developer described sitting down to write a spec and realising, halfway through, that he was writing a description rather than an instruction set. He stopped, rewrote from the beginning, and the resulting implementation needed a single round of review rather than three.

That shift, from describing what needs to exist to specifying how the agent should behave while building it, is not a tool skill. It is not something that comes from learning a new product or reading the documentation. It is a habit of mind, and it develops through direct experience of what happens when it is absent.

The developers who work most effectively with agentic tools are not necessarily the ones with the most AI experience. They are the ones who have learned to plan with more precision: about what the agent needs to know, where the decision points are, and where a human still needs to be in the loop. That precision shows in the output. And it shows up in the first conversation, before any code has been written.