AI Strategy
January 30, 2026

AI agents can't replace your best people (and that's exactly the point)

The best AI model scored 24% on real consulting, banking, and legal tasks. A research paper argues reliable agents are mathematically impossible beyond a certain complexity. This isn't a failure story. It's a strategy story.

AI agents can't replace your best people (and that's exactly the point)

The benchmark nobody passed

Mercor, the training-data company, just released APEX-Agents: a benchmark that tests leading AI models on actual white-collar tasks drawn from consulting, investment banking, and corporate law. Not toy problems. Not academic exercises. Real work, designed and scored by professionals who do these jobs every day.

The best model, Gemini 3 Flash, scored 24% accuracy. GPT-5.2 hit 23%. Opus 4.5, Gemini 3 Pro, and GPT-5 landed around 18%. The vast majority of the time, the models came back with wrong answers or no answer at all.

The biggest failure point wasn't reasoning. It was finding information. Mercor CEO Brendan Foody described the gap to TechCrunch: "The way we do our jobs isn't with one individual giving us all the context in one place. In real life, you're operating across Slack and Google Drive and all these other tools." Real knowledge work means juggling a dozen systems and contexts at once while holding the whole picture in your head. The models couldn't.

This is the hardest data we've seen on agent capabilities. Real tasks. Real scoring. One in four, on a good day.

The math behind the gap

A paper by Vishal and Varin Sikka, surfaced by WIRED last week, goes further than any benchmark. "Hallucination Stations" argues that the limitations aren't just current. They're structural.

The core finding: transformer-based models cap their processing at a specific computational complexity. Beyond that ceiling, the model will hallucinate. Not might. Will. The authors prove it using the Hartmanis-Stearns time hierarchy theorem, and Vishal Sikka doesn't mince words: "There is no way they can be reliable."

The practical version is simpler. When you chain steps together in an agentic workflow, accuracy compounds downward. If each step is 98% accurate and a task requires 20 steps, overall reliability drops to around 67%. At 95% per step, a 20-step task falls to 36%. The more complex the workflow, the worse the math gets.

And verification faces the same constraint. Checking an agent's work often requires equal or greater computational complexity than the original task. You can't solve the reliability problem by adding another agent to watch the first one.

This isn't a "give it time" situation. Compounding error isn't a bug to be fixed. It's math.

Some of this works

None of this means agents are useless. McKinsey now runs 25,000 personalized AI agents alongside its 40,000 human employees. Calix has over 700 employee-built agents deployed internally. These aren't experiments.

But look at what the agents actually do. Calix CEO Michael Weening describes it plainly: "Agentic AI is purely a workflow, and every task in a workflow is an agent." His teams use agents to write emails faster, generate subscriber offers, automate diagnostics, and route customer interactions. Not replacing analysts. Not running strategy. Handling the repetitive, well-bounded parts of workflows that humans already understand.

The pattern holds everywhere agents produce results: the task is bounded, the context is provided, and a human built the workflow the agent is executing. The failures come when you hand an agent an open-ended problem and expect it to figure things out.

The invisible factory floor

The debate keeps getting stuck on the wrong question: "Can agents replace knowledge workers?" The Mercor data makes the answer clear. No. But that was always the wrong thing to ask.

Better question: can agents encode and scale what your experienced people already know?

Think of it as an invisible factory floor. Manufacturing works because complex production gets decomposed into repeatable, quality-controlled steps. Each step is well-defined and measurable. The whole system produces consistent output because nobody expects a single machine to figure out the entire process on its own.

Knowledge work has resisted this decomposition for decades. The work is "creative." It requires "judgment." Both true. But within any knowledge worker's day, there are hundreds of micro-tasks that require neither. They require executing a known procedure, in a known context, to a known quality bar.

That's where agents belong. Not as autonomous employees trying to work across Slack and legal databases and spreadsheets simultaneously. As infrastructure. The invisible factory floor underneath your best people, running the repeatable parts at a speed and scale no human team can match.

Matt Fitzpatrick, CEO of Invisible Technologies, put it directly in their 2026 Agentic Field Report: "2026 is about going operational, not going autonomous." He goes further: "True competitive advantage won't come from cutting 10 roles to five. It will come from getting 1,000-person output from the same 10."

That single sentence is a better agent strategy than most companies' entire AI roadmap.

What the winners do differently

The companies getting results from agents share a few patterns.

They decompose before they deploy. Instead of handing an agent a complex task and hoping, they break work into bounded steps with clear inputs, expected outputs, and quality checks. The agent handles one step. Not the whole workflow.

They start with their best people, not instead of them. Agents encode what experienced humans already know how to do. The expert designs the process. The expert defines what "good" looks like. The agent executes at scale. This is why the companies that cut people to fund AI found that the cuts made the AI harder to implement, not easier.

They measure at the step level. When a 20-step process fails, knowing the overall accuracy tells you nothing. Knowing which step broke tells you everything. Step-level measurement turns compound error into compound reliability, one fix at a time.

Everest Group CEO Jimit Arora calls this the difference between "building agents that can do actions" and granting "true agency." Most companies are still in the first category. The second will take years.

The real question

Gartner projects that over 40% of agentic AI projects will be cancelled by the end of 2027. Not because agents don't work. Because companies deployed them as autonomous employees instead of building operational infrastructure.

The executive who reads the Mercor benchmark and starts thinking about which roles to cut is headed for that 40%. The one who starts thinking about which expert knowledge to encode and scale is going to pull ahead.

Your best people aren't threatened by agents. They're the reason agents work at all. Their judgment, their instinct for what "good" looks like in your specific context. No model is going to benchmark its way to that. It was never the model's job.

The model's job is to take what those people know and run it at a scale they never could alone. That's the point.

Share this journal entry

Get The Pepper Report

The short list of what we're writing and what we're reading.

Published monthly
Thoughts, guides, headlines
Light editorial, zero fluff