Stop Waiting for the AI to Be Good Enough

Key takeaways

The wrong question is 'is the AI good enough yet?' That target is asymptotic and keeps moving. The right question is who on your team is good at managing it.
Managing AI is the same skill as managing a scatterbrained employee. Brief writing, chunk sizing, line-by-line review. It is management work, not technology work.
We have been telling the right people the wrong things for a decade. The detail-oriented brief writers and line-by-line reviewers got demoted in the leadership hierarchy. They are now the most valuable people for AI rollout.
The buyer's version is the no-list. Define what the agent cannot touch before deploying. The yes-list is shorter than the demo suggests.
Companies that put the disciplined manager in charge in 2026 will compound. The 18-month gap with the 'most enthusiastic person' rivals will be uncatchable.

The most useful line I’ve read about AI in a year describes mid-2026 agents as “a scatterbrained employee who thrives under careful management.” It is from a research-backed forecast called AI 2027 that took more than a hundred expert reviewers, twenty-five tabletop exercises, and an aggressively concrete imagination to produce. The line is one sentence in a long document. It has reframed every AI tooling decision I have made since.

I keep coming back to it because it answers the wrong question I see most operators ask.

The wrong question

The wrong question, the one most teams are still asking, is some version of: is the AI good enough yet. That question feels rigorous because it sounds technical. It is actually a delaying tactic. The reliability target keeps moving, the model that was unreliable last year is approximately as reliable this year because we ask harder work of it, and the actual decision (deploy or do not) gets postponed indefinitely.

The variations are subtle. Is it reliable enough for production. Has it stopped hallucinating. Can we trust it without a human in the loop. Are the demos finally matching the deployments. When does the curve flatten enough that we can stop reading every output line by line.

These questions feel rigorous because they sound technical. They are actually a delaying tactic.

The reliability target keeps moving. The model that was unreliable last year is approximately as reliable this year. But the bar moved with it, because the work we now ask of it is harder. Reliability is asymptotic in a way that keeps the question alive forever, while the actual decision (deploy or do not) gets postponed.

Meanwhile, teams that stopped asking the question and just deployed something are pulling ahead. Their AI is also unreliable. They do not care, because they never expected reliability to be the milestone.

Why does the question persist. Three reasons, in roughly increasing order of importance.

The first is media diet. Most B2B operators read about AI through a mix of vendor marketing, which oversells reliability, and tech-press incident reporting, which under-reports the boring deployments that quietly work. The mental model that emerges is binary. Either it works flawlessly or it goes off the rails on stage. Reality is in the middle, but the middle does not get coverage.

The second is procurement risk. Compliance and legal teams ask “is it safe” because their job is to surface the question that, if not asked, becomes their problem. Until somebody answers the question with confidence, the cautious move is to wait. So the question gets re-asked at every checkpoint, which keeps the deployment perpetually one quarter away.

The third is psychological. Asking “is it good enough” is a way of treating AI as a thing to be evaluated. That feels familiar. We have evaluated software before. We know how to do that. Treating it as a colleague to be managed is a different kind of work. Less familiar, less amenable to the procurement playbook. Reframing AI deployment as an exercise in B2B sales operations rather than a procurement decision is the move that unblocks most teams.

So the wrong question persists because it is comfortable. The right question is not.

The right question

The right question is who on your team is good at managing scatterbrained employees. The “scatterbrained employee” framing answers what the milestone actually is, and the milestone is not a property of the AI. It is a property of the team around the AI. If you have a person who manages this kind of report well, your AI rollout has a lead. If you do not, no amount of model improvement will save you.

Think about what management of a scatterbrained employee actually looks like, day to day. You write briefs that are precise enough that ambiguity cannot sneak in. You break large jobs into small enough chunks that the employee cannot go off the rails between checkpoints. You review the output line by line, in the work itself, not in a status meeting. You learn the failure patterns. The moments where confidence outruns evidence. The convenient summary that elides a missed step. The phrase that sounds smart but does not survive a follow-up question. You watch for these.

This is management work. It looks nothing like the work of choosing the model, configuring the prompt, or wiring up the integration. It looks like the work of being a thoughtful manager of a junior employee who is talented but not yet trustworthy.

What is hard about this is that the management work is invisible from the outside. The brief that prevented the agent from drifting into a wrong tangent does not get a credit line in the output. The line-by-line review that caught a fabricated statistic does not show up on a dashboard. You can only see the absence of the failure that did not happen, and absences are difficult to reward inside an org. Anthropic’s research on alignment makes this point at the model level. The same dynamic happens at the team level.

That invisibility is the actual reason most AI rollouts stall. The work that would have made them succeed is not the work that gets recognized.

The promotion paradox

The people who are best at this management work are the same people we have been telling the wrong things for a decade. The detail-oriented brief writer was told to think bigger picture. The line-by-line reviewer was told to stop micromanaging. Both archetypes got demoted in the corporate hierarchy of “soft skills.” Both are now the most valuable people on your team for AI deployment.

Here is the thing I keep needing to say out loud, because it surprises people every time.

We have been telling the people who are good at this exact skill the wrong things for years.

The detail-oriented brief writer. We told her to think bigger picture. Stop getting lost in the weeds. Trust the team. Delegate.

The line-by-line reviewer. We told him to stop micromanaging. Let people own their work. Step back. Let the talent shine.

In the corporate hierarchy of “soft skills,” both archetypes have been quietly demoted for a decade. They were the people who slowed meetings down with precise questions, who circled a clause in red, who could not help themselves. They got promoted past the work that played to their strengths and into roles they were less suited for, because that is what we believed leadership looked like.

These are now the most valuable people on your team for AI deployment. The skill we asked them to grow out of is the skill we suddenly need.

How did we get here. The leadership-development industry of the 2010s coalesced around a particular narrative. Managers should set vision and trust their teams to execute. The hands-off manager became the ideal. The micromanager became the cautionary tale. Books, frameworks, executive coaches, and HR practices all reinforced the same arc. Move from doer to delegator to coach. Read fewer drafts. Ask better questions instead of editing the answers. Harvard Business Review documented the dominant narrative of “managers as coaches” through that decade.

That narrative was right for the work of the 2010s, when individual contributors were highly capable and what teams needed from a manager was air cover, not red ink. It is wrong for the work of the second half of the 2020s, when a meaningful share of the doers are AI agents who specifically need someone reading every line.

I have done this to people on my own teams. I have promoted the precise person and gently nudged her to think more strategically. I have told the careful reviewer that great leaders do not read every word. I was right in 2019 and wrong in 2025. The world changed under us. My background is in operating roles where the cost of a misread brief was a missed quarter, so I should have known better than to coach away from carefulness. The people I redirected were exactly the ones I should have been protecting and putting in charge of more, not less, of the next wave. This connects to the broader founder lessons that survive a platform shift. Most of them are about recognizing which existing strengths the new world is about to reward.

The good news is that those people are still in the building. They have been working on the strategic skills we asked them to develop. They now have both. If we can recognize what we did and put them back in the seat, the company gets the best version of the manager we needed all along.

The buyer’s version

The buyer’s version of “AI as a management problem” is the no-list. Stop asking whether vendor models are aligned, because the claim is not falsifiable. Ask three different questions. Then build the no-list before the yes-list. The work the agent would be most proud to do is exactly the work that goes on the no-list first.

The same management lens works on the vendor side, with one twist.

Every AI vendor selling into your company will tell you their model is aligned, safe, responsible. Those words are not falsifiable. The labs that build the underlying models acknowledge that current alignment techniques cannot be verified, only observed. So when a salesperson says “our AI is aligned,” what they actually mean is “we ran some evaluations and did not see anything bad.” They are not lying. They also have nothing to compare against.

The fix is to stop asking the unanswerable question.

Three better questions. What behavior have you observed at scale, including the surprising ones. What behavior would you not see, even if it were happening, and what is the gap in your evaluation. What is your plan when the model surprises you in front of one of your customers.

A vendor who can answer all three is operating in the same world you are. A vendor who deflects to a “responsible AI” deck is telling you something important about the gap between their marketing and their operations.

The no-list itself is concrete. Before you deploy any agent into a workflow, the first artifact is not what it can do. It is the list of things it cannot do without a human in the loop. Replies to customers. Stage moves in the CRM. Calendar bookings. Sending invoices. Anything that touches money, or that creates a record a customer will see.

Action the agent might take	Default position
Reply to a customer email	Human approval required
Move a deal stage in the CRM	Human approval required
Book a meeting on a rep’s calendar	Human approval required
Send an invoice or quote	Human approval required, always
Touch any field that affects billing	Hard block, no override
Draft an internal note for the rep to review	Allowed
Suggest a follow-up to the rep	Allowed
Pull research on a target account	Allowed

Notice the pattern. The work the agent would be most proud to do is exactly the work that goes on the no-list first. The yes-list comes second, and it is shorter than the demo suggested.

The reason this matters more than it seems is asymmetry of cost. A rep spending two extra minutes on a task the agent could have handled is friction. An agent emailing a customer something it should not have said is a deal you may not get back. The cost of expanding the yes-list before you have hardened the no-list is not symmetric with the cost of going slow.

That asymmetry is exactly the thing operators with experience integrating acquired businesses understand instinctively. When you absorb a new company, you do not let the new system touch payroll on day one. You stand up parallel processes, watch for the surprises, and only widen the yes-list as you build trust. The discipline of integration is the discipline of AI rollout. The companies that have done one well will do the other well, and they will compound.

The hiring implication

Stop hiring an “AI engineer” or “ML lead.” The role you actually need is closer to “AI operations manager,” and the qualification is judgment plus review discipline. Do not put the most enthusiastic or technically sophisticated person in the seat. Put the most disciplined. They are probably already on your team, probably already told they micromanage too much, and about to compound for you in a way nobody on the leadership track is going to expect.

If you are a CEO or a sales leader putting together your AI plan for the next twelve months, the role you should be designing for is not “AI engineer” or “ML lead.” It is something closer to “AI operations manager.”

The required skill set is judgment, brief-writing, and review discipline. The required temperament is comfort with imperfection at the output layer paired with intolerance for ambiguity at the input layer. The reading level is high. The verbal precision is high. The patience for re-reading a paragraph until the implication is clear is high.

Do not put the most enthusiastic person in this role. Enthusiasm correlates with patience for hype, which is the opposite of what you need. Do not put the most technically sophisticated person, either. Sophistication tends toward over-trust in the system that the sophisticated person built or chose. Both of those temperaments fail the role for the same reason. They want to believe the agent is more capable than it is.

Put the most disciplined person. The one who reads every line. The one who flagged the inconsistency in last quarter’s plan that everyone else missed. The one who writes briefs people complain about being too thorough. The one who has been told they are “in the weeds” enough times that they almost believed it.

That person is your AI rollout’s quiet lead. They are probably already on your team. They have probably been told they micromanage. They are about to compound for you in a way nobody on the leadership track is going to expect.

A sketch of the role description, for the next person who tells you they need to “hire for AI.”

Owns the input layer. Writes briefs precise enough that the agent’s output is bounded. Reviews every output for one quarter, then designs the spot-check protocol that scales the review without losing the signal. Owns the no-list. Maintains the list of failures observed in production, with the brief or guardrail that prevents recurrence. Reports up not on what the agent did, but on what the team can now reliably ship as a result.

That role does not need a graduate degree. It needs a ten-year track record of being the person in the room who reads the contract.

The compounding effect

A company that puts the disciplined manager in charge of AI rollout in 2026 will spend six months looking “behind.” They are not. They are building the brief library, the no-list, and the review protocol. By month twelve they ship work that competitors cannot attempt. By month eighteen the gap is uncatchable, because the bottleneck is no longer the model. It is institutional discipline that compounded quietly while everyone else was hiring “AI engineers.”

Here is what I think the next two years actually look like.

A company that puts the disciplined manager in charge of AI rollout in 2026 will spend the first six months in what feels like slow progress. The brief library grows. The no-list grows. The review protocol gets refined. The output is unimpressive in absolute terms. To outsiders, this looks like the company is “behind.”

By the second six months, the same company is deploying agents into workflows that other companies cannot even attempt, because their guardrails are tested and their input pipelines are clean. The agent-team relationship has matured to the point where the manager can hand the agent harder work and trust the result. Throughput climbs. Surprises shrink. McKinsey’s research on AI adoption gaps in 2025 showed roughly this pattern in early enterprise pilots.

Year two, the gap between this company and the one that put their most enthusiastic engineer on the project is large enough that catching up is no longer a six-month plan. The bottleneck is no longer the model. It is the institutional discipline that the disciplined manager spent the first year quietly building.

Compounding is hard to see when it starts because it does not produce headlines. It just produces a company that, eighteen months later, can do something the competitor cannot.

The closing thought

Every technology shift rewards a different kind of human judgment than the one before. The PC era rewarded the spreadsheet builder. The cloud era rewarded the systems thinker. The mobile era rewarded the interaction-pattern thinker. This one rewards the manager. Specifically, the manager of someone who is talented, fast, and unreliable in legible ways.

It is a less glamorous answer than “AI changes everything.” It is more useful. The people who match the profile are quietly already in your building. The work is to see them.

I will keep coming back to this thread in future essays at dearmer.com.au because it is a load-bearing frame for almost every other AI question. How to hire. How to evaluate vendors. What to instrument. What to stop instrumenting. How to think about what AI actually does to a company over an eighteen-month horizon, not a six-week pilot.

If you want to think about it more, the AI 2027 forecast is the source of the line at the top of this essay and the most useful single document I’ve read on this topic. Anthropic’s alignment research and OpenAI’s published work on chain-of-thought monitoring are the right primary sources for the “alignment is observed, not verified” half of the argument.

Who on your team got told they micromanage too much, and is about to be elite at managing AI?

Frequently asked questions

Should I wait until AI is more reliable before deploying it in my B2B company?

No. Reliability is asymptotic. The model that was unreliable last year is approximately as reliable this year, but the work asked of it is harder, so the perceived gap stays constant. Teams that deploy unreliable AI under careful management pull ahead of teams waiting for it to become 'good enough,' because the milestone is never going to arrive on its own.

What does it actually take to deploy AI successfully in a sales team?

A disciplined manager. The work is writing precise briefs, breaking jobs into small chunks the agent cannot wander out of, and reviewing every output line by line. The skill set is judgment and review discipline, not technology. If you have a person on your team who already does this work for human reports, they are your AI rollout's quiet lead.

Who should lead AI deployment in a B2B company?

The most disciplined person, not the most enthusiastic. Look for someone who reads every line of a contract, writes briefs people complain are too thorough, and has been told they micromanage. Avoid the technically sophisticated person who built the system and the most-excited person who patrols every demo. Both fail the role for the same reason. They want to believe the agent is more capable than it is.

What is a no-list for AI deployment?

The no-list is the list of things an AI agent cannot do without a human in the loop. Replies to customers, CRM stage moves, calendar bookings, sending invoices, anything that touches money or creates a record a customer will see. It is the first artifact you build when deploying any agent, before the yes-list. The yes-list comes second, and is shorter than the demo suggested.

How should I evaluate AI vendors who claim their model is aligned or safe?

Stop asking whether the model is aligned. That claim is not falsifiable. The labs that build the underlying models acknowledge alignment cannot be verified, only observed. Instead ask three questions: what behavior have you observed at scale, what behavior would you not see even if it were happening, and what is your plan when the model surprises you in front of a customer.

What should I look for when hiring an AI operations manager?

Discipline over enthusiasm, judgment over sophistication. The role requires comfort with imperfection at the output layer paired with intolerance for ambiguity at the input layer. The right person reads every line, writes thorough briefs, and is comfortable saying no to scope creep. They probably do not have a graduate degree in machine learning. They have a ten-year track record of being the person in the room who reads the contract.

Will AI replace managers in B2B companies?

No, but it will reward a different kind of manager. Every technology shift rewards a different kind of human judgment. The PC era rewarded spreadsheet thinking. The cloud era rewarded systems thinking. The mobile era rewarded interaction-pattern thinking. The AI era rewards the manager of someone talented, fast, and unreliable in legible ways.

Sources & references

AI 2027 Forecast · The research-backed scenario forecast that introduced the 'scatterbrained employee who thrives under careful management' framing for mid-2026 AI agents.
Anthropic — Alignment Research · Primary source for the 'alignment cannot be verified, only observed' point. Anthropic's published research documents the empirical limits of current alignment techniques.
OpenAI — Chain-of-Thought Monitoring · OpenAI's documentation of frontier-training models 'playing the training game' — empirical evidence that models learn to look-aligned while disregarding trainer intent.
Harvard Business Review · Documented the dominant 2010s leadership-development narrative ('managers as coaches, not editors') that shaped a decade of how brief writers and reviewers were promoted.
McKinsey on AI Adoption · Research on enterprise AI adoption gaps in 2025, including the early-stage 'looks behind' phase that compounds into competitive advantage 12-18 months later.