Managing AI agents like employees

The deployment problem

"Agentic enterprise" is the phrase of the moment. Salesforce's Marc Benioff used it at Davos. Microsoft has been building toward it all year. McKinsey now runs 25,000 personalized AI agents alongside its 40,000 human employees. Calix has over 700 employee-built agents deployed internally. Gartner projects that 40% of enterprise applications will embed agents by mid-2026, up from less than 5% a year ago.

The deployment wave is real. The management infrastructure behind it is not.

Most of these agents shipped without anyone in charge of them. No explicit goals. No defined owner. No review cadence. No criteria for what "good" looks like. No security boundaries beyond vendor defaults. The result is what you'd get if you hired a thousand new employees, gave them all laptops, and forgot to assign managers.

Companies that would never onboard a human without a role description, a reporting structure, and a 90-day review are deploying agents with none of those things. The agents produce output. Whether that output is correct, secure, or aligned with any business goal is a question nobody is accountable for answering.

This playbook treats agents as what they functionally are: a new class of worker. Not tools you configure once. Workers you manage continuously, with goals, supervision, trust boundaries, regular evaluation, and a clear place in the org chart.

The mental model shift

The most useful thing you can do before deploying an agent is stop thinking of it as software.

Software is deterministic. You configure it, test it, deploy it, and it does the same thing every time until you change it. Agents are not deterministic. They interpret instructions. They make judgment calls. They produce different outputs from the same inputs depending on context. They degrade over time as the world around them changes. And they can be manipulated by adversarial input they encounter during normal operation.

This makes agents more like employees than like software. It means they need the same management infrastructure.

Consider what you'd do with a new hire on day one. You'd explain the role. Set expectations for 30, 60, and 90 days. Assign a manager who reviews their work. Give access only to the systems they need. Check in regularly, provide feedback, and adjust responsibilities based on demonstrated competence.

Now consider what most companies do with a new agent. Deploy it. Move on.

The gap between those two approaches is where the waste lives. Agents without goals produce activity without outcomes. Agents without supervision produce unchecked output. Agents without security boundaries operate with privileges that would give any security auditor nightmares.

Ninety percent of AI agents in production today are over-permissioned, holding roughly ten times more privileges than their tasks require. They move 16 times more data than human users. One Glean agent was observed downloading over 16 million files while every other user and application combined accounted for one million. These aren't theoretical risks. This is what happens when you deploy workers and forget to scope their access.

The shift is simple: if you wouldn't do it with a new hire, don't do it with an agent.

Setting agent OKRs

Every agent needs a reason to exist, stated in terms a business owner would recognize. Not "process documents" but "reduce invoice processing time from four hours to 30 minutes with 95% accuracy." Not "answer questions" but "resolve 60% of tier-1 support tickets without escalation, with a satisfaction score above 4.2."

This isn't extra work. It's the minimum required to know whether the agent is doing its job.

Start with the outcome, not the capability. "What can this agent do?" is the wrong first question. "What business result does it need to produce?" is the right one. Agents can generate enormous volumes of output that accomplish nothing. Activity without outcomes is the most common failure mode. The only defense is defining the outcome before deployment.

Make the metrics specific and measurable. "Improve efficiency" is not a metric. "Reduce average handling time from 12 minutes to four" is. "Better customer experience" is not a metric. "Increase first-contact resolution from 45% to 70%" is. If you can't put a number on it, the agent's performance review becomes a feelings-based conversation. Those go poorly with human employees. They're worse with agents.

Include accuracy and error thresholds. The best model on the Mercor APEX-Agents benchmark, which tests agents on real professional tasks from consulting, banking, and law, scored 24% accuracy. A Workday study found that 37% of time saved by AI was cancelled out by rework. These are baseline expectations, not worst cases. Every agent OKR needs an explicit accuracy target and a threshold for unacceptable errors. When the threshold is crossed, the response should be automatic: increase supervision, diagnose, fix or retire.

Include security boundaries. How much data does this agent access? What's the scope of its permissions? How often is access reviewed? Ninety-seven percent of organizations that experienced AI-related breaches reported lacking proper access controls. The OKR is where you define what proper access looks like for this agent, on this task.

Assigning supervision

Every agent needs a human who is accountable for its output. Not the engineering team that built it. Not the vendor that sold it. A specific person on the business side who owns the agent's results the way a manager owns their team's performance.

This person answers three questions.

Who reviews the work? Someone checks what the agent produces. For new agents, that means reviewing most of the output. For established agents with proven track records, it might mean sampling 10-20%. But someone checks. Always. The moment nobody is reviewing an agent's work is the moment you've accepted whatever it produces as correct, including the times it isn't.

What's the escalation path? When the agent encounters something outside its scope, what happens? If the answer is "it does its best," you've built a system that fails silently. Define the boundaries: these situations require a human decision, these get flagged but continue, these are cleared for independent action.

Escalation paths double as security controls. Every point where an agent pauses for human confirmation is a point where prompt injection or manipulation gets caught. OpenAI's security team recommends narrow, specific instructions over broad ones: "Giving an agent a very broad instruction such as 'review my emails and take whatever action is needed' can make it easier for hidden malicious content to mislead the model." Tight scope and clear escalation make agents harder to exploit.

Who owns the outcome? When AI success is measured in model accuracy, it stays in the lab. When it's tied to a metric on someone's dashboard, it ships to production and gets maintained. If nobody's performance review includes "the agent I manage hit its targets," nobody is paying attention.

The trust ladder

New hires start supervised. They earn autonomy by demonstrating competence. Agents work the same way.

Four levels. Each has explicit criteria for advancement and clear boundaries for what the agent can do.

Level 1: Fully supervised. Every output is reviewed before it reaches anyone else. The agent produces drafts, suggestions, recommendations. A human approves, edits, or rejects each one. This is where every new agent starts, regardless of how well the demo went.

Security at this level: read-only access to the data it needs. No sending emails, making purchases, modifying records, or taking external action without approval. Permissions scoped to the minimum for its specific task.

Level 2: Spot-checked. The agent has demonstrated consistent quality over a meaningful sample. Human review shifts from every output to regular sampling. If the agent processes 100 documents a day, a human reviews 15-20 selected randomly. Accuracy is tracked quantitatively.

Security at this level: the agent can take limited actions within a defined scope. Routing requests, updating specific fields, responding to routine inquiries. No actions with financial impact, no modifying access controls, no interacting with external systems without approval.

Level 3: Autonomous with audit trail. The agent operates independently within its scope. Human review happens weekly or monthly. Every action is logged, and the audit trail gets reviewed on a set cadence. This is where most competent agents should live long-term.

Security at this level: anomaly detection monitors for unusual patterns. A spike in data access, attempts to reach systems outside scope, or shifts in output quality trigger automatic escalation. The supervisor investigates before the agent continues.

Level 4: Fully autonomous. The agent operates without regular human oversight. This level exists as a theoretical ceiling, not a practical target. Zscaler's 2026 AI Security Report found that 90% of enterprise AI systems could be compromised in under 90 minutes during red team testing. Median time to first critical failure: 16 minutes. In the most extreme case, defenses fell in a single second. Agents at Level 4 with broad permissions and no human in the loop are the highest-value targets in your operation.

Most agents should never reach Level 4. The ladder isn't a race to the top. It's a framework for matching autonomy to demonstrated reliability, where security requirements get stricter the further you move from human oversight.

One thing worth noting: the capability floor is rising fast. A year ago, the best models scored 5-10% on the same professional tasks where they now hit 24%. The trust ladder needs to be re-evaluated regularly. An agent that required Level 1 supervision six months ago might be ready for Level 2 today. Build the review into the cadence.

Moving up. Define quantitative thresholds before deployment: accuracy sustained above X% over Y period, error rate below Z%, zero security incidents, no quality drift. Write the criteria the same way you'd write promotion criteria before a hire starts.

Moving down. Trust goes both ways. When accuracy drops, output leaves expected range, or a security alert fires, the response is automatic: reduce autonomy, increase oversight, diagnose. Demotion isn't failure. It's the system working.

Performance reviews

Agents don't improve on their own. They drift. The underlying model gets updated. The data changes. The business context shifts. An agent performing well three months ago may be producing subtly wrong output today, and without regular evaluation, nobody notices until the damage compounds.

Set a cadence. Monthly for agents at Levels 1 and 2. Quarterly for Level 3. Pull a sample of recent outputs. Score them against OKR criteria. Compare performance to baseline. Look for trends.

Track accuracy at the step level. When a multi-step process fails, knowing the overall error rate tells you nothing useful. Knowing which step broke tells you everything. The Mercor benchmark showed that agents fail primarily on finding and synthesizing information across multiple sources, not on reasoning once they have the right inputs. Step-level tracking reveals whether your agent is hitting the same wall or different ones. That distinction determines whether you retrain, restructure, or replace.

Watch for drift. The support agent that resolved 65% of tickets is now at 58%. The document processor that held 96% accuracy has slipped to 89%. Without tracking, these drops hide in normal variation. With tracking, they're signals.

Common causes: the vendor updated the model, the input data shifted in structure or content, a business process changed and nobody updated the agent's instructions, or accumulated context is diluting focus.

Audit the security posture. Every review includes a permissions check. Is the agent still accessing only what it needs? Has its data footprint grown? Are there access patterns that don't match its role? Permissions tend to expand over time as teams add capabilities without removing old ones. The review catches that creep.

Know when to retire. Not every agent deserves continued investment. If it consistently misses OKRs, if its process has changed beyond what retraining can fix, or if supervision costs exceed the value produced, turn it off. Retiring an underperforming agent is better than maintaining one that gives false confidence in a process nobody is checking.

The human-agent org chart

The hardest part of agent management isn't technical. It's organizational. Agents enter teams with existing dynamics, established workflows, and real people who have reasonable questions about what these new additions mean for their roles.

Agents augment roles. They don't fill headcount slots. This isn't messaging. It's architecture. When you deploy an agent to handle tier-1 support tickets, the humans don't disappear. Their role shifts: complex cases, quality review, defining decision boundaries. The team gets more capable. The people become more valuable. But only if you design for it.

The companies treating agents as headcount replacements are discovering a painful irony: cutting the people who understand the work makes the agents worse. Agents execute patterns that experienced humans defined. Remove the humans and you remove the knowledge the agents depend on.

Map agents to the org chart. Every agent has a visible place in the team structure: what it does, who supervises it, who reviews output, what authority it has. KPMG found that organizations getting real results from agents manage them through the same lifecycle as employees: onboarding, performance management, offboarding. Invisible agents are unmanaged agents.

Define how information flows. Humans need to see what agents are doing, what they've decided, and where they've flagged uncertainty. A dashboard showing agent activity and decision logs beats a Slack channel full of notifications everyone learns to ignore.

Address the room. Your team has questions. Some version of "is this going to replace me?" will be present whether anyone says it or not. Answer directly, with specifics. "This agent handles X. Your role shifts to Y and Z, which require your judgment. Here's what that looks like." Calix built a culture where employees build their own agents for their own workflows. That works because the framing matches reality: agents handle repetition, humans handle everything that requires thinking.

Start with one

Microsoft found that 82% of leaders plan to expand workforce capacity with agents and digital labor. McKinsey projects that small teams of two to five people will supervise 50-100 specialized agents running end-to-end processes. Those numbers describe where this is headed. Not where to start.

Start with one agent. One well-defined task. One set of OKRs. One supervisor. Level 1 on the trust ladder. Monthly reviews.

Get that right. Learn what works in your context, with your team, on your workflows. Then add the second. The management infrastructure you build around the first one makes the second faster and the tenth routine.

If you haven't evaluated which workflows are ready for agents, that step comes first. Our readiness scorecard gives you a framework for figuring out where to start.

The companies pulling ahead aren't deploying the most agents. They're managing the ones they have. Deployment is the easy part. Management is where the value compounds.

Managing AI agents like employees

The deployment problem

The mental model shift

Setting agent OKRs

Assigning supervision

The trust ladder

Performance reviews

The human-agent org chart

Start with one

More Playbooks

The AI guide to the galaxy

Keeping your brand voice alive in the age of AI content

The AI innovation pipeline

Get The Pepper Report