The new bottleneck isn't productivity. It's verification.

05 March 2026 · 6 min read

aiverificationproduct-opsdelivery

Over the past year I’ve written about how AI is restructuring the economics of delivery. The Discover phase, the Decide phase, and most recently the Deliver phase — where I argued that AI collapses the cost of experimentation so dramatically that the bottleneck shifts from execution to judgement.

That argument still holds. But running it in practice raises an obvious next question, and it’s one I’ve been sitting with since: if build-to-learn is the model, how do you know if what you learned is right?

I was part of a panel discussion this morning on AI trust. The conversation kept surfacing the same gap I’ve been noticing in my own work: organisations have largely solved the productivity problem. The constraint that replaced it — verification — is one almost nobody has a serious answer for yet.

The scale of the problem is measurable. Vectara’s Hallucination Evaluation Model (2024) tested leading LLMs and found hallucination rates ranging from 3% to 27% depending on the model and task. When organisations multiply their AI-generated output by 10x, even a low error rate means hundreds of incorrect artefacts entering workflows daily.

What we solved, and what replaced it

In an earlier piece on the economics of AI delivery, I argued that AI inverts the cost structure of experimentation. The old world was build-to-ship: high cost per experiment forced fewer attempts, more upfront analysis, and slower learning cycles. AI collapsed that cost, enabling build-to-learn: many parallel experiments, lower stakes per attempt, faster feedback loops.

That shift is real. But it created a successor problem nobody talks about.

When experiments were expensive, you scrutinised them before building. The cost forced the discipline. When experiments become cheap, the scrutiny has to move — it doesn’t disappear. It migrates from before you build to after you build. From gate-based governance to continuous verification.

The question is no longer “can we afford to test this?” It’s “how do we know if the output is right?”

Why this is harder than it sounds

The instinctive answer is “human review.” Add a checkpoint. Keep a human in the loop. This is the right instinct applied at the wrong scale.

If your team is generating ten times more artefacts — more code, more analysis, more documents, more decisions — then human review at the old ratio means ten times the reviewer time. That’s not a governance model. That’s a bottleneck with a different label.

Gary Marcus, NYU professor emeritus and co-author of Rebooting AI (2019, with Ernest Davis), has consistently highlighted this paradox: AI systems can be simultaneously impressive and unreliable. The productivity gains are real; the reliability gaps are equally real and far less visible until something breaks.

Effective verification at AI-native scale requires something different: scaffolding. The ability to structure AI outputs so they can be tested, not just read. Modular, verifiable, auditable. Unit tests for code. Structured outputs with validation logic. Decision chains that surface their own assumptions.

This isn’t primarily a technology problem. It’s a design problem. And it’s a leadership problem — because the people setting AI strategy in most organisations are still measuring success by output volume rather than output reliability.

The skill atrophy risk is real, but it’s misdiagnosed

There’s a legitimate concern that AI is eroding the skills people use to evaluate AI outputs. If you’ve never written the code yourself, how do you know the code is wrong? If you’ve never structured the analysis, how do you spot when the synthesis is plausible-sounding but incorrect?

This is often framed as an argument against AI adoption. It isn’t. It’s an argument for deliberate AI adoption — one that preserves the evaluative skills while offloading the generative ones.

The analogy to spreadsheets is instructive. Excel didn’t make financial analysts worse at finance. It made them faster, and it made innumerate people dangerous. The organisations that got value from Excel were the ones that understood which problems it was solving and which new risks it introduced.

AI is the same pattern at greater scale and higher stakes. The risk isn’t that AI makes people less capable. The risk is that organisations deploy AI to people who never had the underlying capability to begin with — and don’t realise it until the output has already caused damage.

What leaders need to change

The shift from productivity mindset to verification mindset requires three things:

First, measure differently. Output volume is the wrong metric. Speed-to-verified-output is the right one. If your team ships twice as fast but your error rate doubles, you haven’t improved — you’ve just moved the cost downstream.

Second, invest in verification infrastructure before it’s urgent. Scaffolding, testing frameworks, structured output design — these feel like overhead when everything is working. They become critical the first time something goes wrong at scale. Build the plumbing before you need it. MIT Sloan Management Review and Boston Consulting Group’s annual AI research (2024) finds that the majority of organisations fail to scale value from AI—with governance and verification gaps, not technical capability, emerging as the primary barriers.

Third, treat governance as a design constraint, not a checkpoint. The organisations that handle this well don’t review AI outputs at the end of the process. They design the process so that verification is intrinsic — outputs that can’t be verified aren’t allowed into the workflow.

The foresight signal

When I look at where organisations are struggling with AI — not the early adopters figuring it out, but the mainstream organisations now midway through implementation — the pattern is consistent. They solved the productivity problem. They didn’t anticipate the verification problem. And they’re now trying to retrofit governance onto systems designed for speed.

This is predictable. It’s also fixable. But fixing it requires naming it clearly: the bottleneck has moved. The new constraint isn’t building. It’s knowing whether what you built can be trusted.

That’s not a technology question. It’s a leadership one.

Tim Robinson is an AI foresight consultant and strategic advisor. He works with executive teams and boards on AI readiness, governance, and the gap between AI ambition and operational reality. More at agilist.co.uk.

Tim Robinson

Transformation Consultant & AI Practitioner

20+ years fixing how organisations work. I help leadership teams redesign operating models and apply AI where it actually matters.

Book a quick chat

26 February 2026

Rapid Experimentation: How AI Transforms the Economics of Delivery

The most expensive product decision isn't selecting the wrong feature—it's failing to test alternatives. AI fundamentally inverts the economics of product delivery.

07 February 2026

Making Better Decisions Faster: AI-Powered Prioritisation and Strategic Alignment

Product teams struggle with decision-making not due to lacking frameworks, but because important decisions involve competing priorities and incomplete information. AI transforms this from high-stakes events into continuous, evidence-based exploration.

14 January 2026

Why Your Product Team Is Stuck (And It's Not Your Fault)

Your product team is not stuck because of effort or capability. The operating model is the bottleneck, slowing synthesis, prioritisation, and experimentation.