The AI readiness scorecard

The evaluation problem

Eighty-eight percent of organizations now use AI in at least one business function. Ninety-five percent of generative AI pilots fail to deliver measurable financial returns. Those two numbers are both true, and the gap between them is where most companies are stuck.

The usual explanation is that companies are "doing AI wrong." That's not helpful. A better explanation: most companies are picking the wrong workflows to automate. They choose based on vendor demos, executive enthusiasm, or whichever process annoys people the most. The evaluation step gets skipped because it feels like unnecessary friction when the mandate is to move fast.

It is not unnecessary friction. It is the step that determines whether your AI investment produces a case study or a write-off.

This playbook gives you a scoring framework for evaluating workflows before committing resources. Not a maturity model. Not a strategy deck. A practical method for answering "where should we start?" with evidence instead of intuition.

The spectrum that explains everything

Aaron Levie, the CEO of Box, made an observation that should be pinned to the wall of every AI strategy meeting: coding was the first knowledge work domain where AI agents took off because it has the best possible characteristics for AI. The entire domain is text. The work is modular and self-contained. Feedback loops are tight. You can verify the output immediately. Almost no other knowledge work has these properties, which is why AI agents will take longer to show up everywhere else.

That observation is the foundation of this entire scorecard. The question isn't "can AI do this?" The question is "does this workflow have the characteristics that let AI succeed?"

Coding has all of them. Most knowledge work has some. A few workflows have none. Your job is to figure out which category each workflow falls into before you spend money.

Six criteria that predict success

Every workflow can be scored against six dimensions. None of them have to do with how sophisticated your AI model is. All of them have to do with the structure of the work itself.

1. Text-based inputs and outputs

AI processes text. That's the core capability. Workflows that already run on text (documents, emails, chat logs, forms, code, structured data) are natural candidates. Workflows that depend on physical observation, spatial reasoning, or tacit knowledge carried in someone's head are not.

The question to ask: could a new hire do this job using only a computer screen, or do they need to walk a factory floor, read a room, or watch someone's hands?

2. Structured and repeatable

A workflow that follows roughly the same steps each time is far easier to automate than one that reinvents itself with every instance. Invoice processing follows a pattern. Contract review follows a pattern. Quarterly strategic planning does not.

Repeatability matters because AI learns from patterns. If the workflow is different every time, the AI has nothing stable to learn from. It will produce output that sounds plausible and misses the point.

3. Modular steps

Can the workflow be broken into discrete, independent steps? Or is it one continuous stream of judgment where each decision depends on everything that came before it?

Modular workflows let you apply AI to the steps where it adds value and keep humans on the steps where it doesn't. Monolithic workflows force an all-or-nothing choice. All-or-nothing choices usually end at nothing.

Willem Ave, head of product at Square, put it well in a recent interview: companies are getting creative about "connecting [AI] to deterministic systems that will take the variability out of AI results." The insight is that AI works best when paired with structured processes, not when left to generate answers in open space. Modularity is what makes that pairing possible.

4. Tight feedback loops

How quickly can you tell whether the AI's output was good? In coding, you run the tests. In document summarization, you read the summary. In customer support triage, you check the routing accuracy. These are tight feedback loops. You know within minutes or hours whether the result was right.

Compare that to strategic recommendations, where you might not know if the advice was good for six months. Or hiring decisions, where the feedback loop stretches across a full performance cycle. The looser the feedback loop, the harder it is to improve the AI over time, and the longer you'll spend in the "is this working?" phase without a clear answer.

5. Tolerance for error

Every AI system makes mistakes. The question is what happens when it does.

A misrouted support ticket gets corrected by a human agent. A bad first draft of a document gets edited. These are workflows with error tolerance. A wrong medical dosage, a missed compliance flag, a fabricated legal citation: these are workflows where a confident-sounding wrong answer is worse than no answer at all.

High tolerance for error means AI can serve as a capable first pass that humans refine. Low tolerance means you need an entirely different architecture: AI assists, humans decide, and nothing reaches the customer without review.

6. Existing digital infrastructure

Does the workflow already live in software? Are the inputs already digitized? Are there APIs, databases, and systems that an AI tool can plug into?

A workflow that runs on paper forms, phone calls, and institutional memory in someone's head is not a bad AI candidate forever. But it needs a digitization step before AI is even relevant. That's a prerequisite, not a parallel workstream, and skipping it is how companies end up with an AI solution that nobody can feed data into.

Scoring in practice

The criteria make more sense when applied to real workflows. Here are three examples that illustrate different points on the readiness spectrum.

High readiness: invoice processing

A mid-size company processes 2,000 invoices per month. Each invoice arrives as a PDF or email attachment. A human opens it, extracts the vendor name, amount, line items, and payment terms, matches it against a purchase order, flags discrepancies, and routes it for approval.

Score it against the criteria:

Text-based inputs and outputs. Yes. Invoices are documents. The inputs and outputs are all text and numbers. Structured and repeatable. Yes. Every invoice follows roughly the same extraction and matching process. Modular steps. Yes. Extraction, matching, flagging, and routing are distinct steps that can be addressed independently. Tight feedback loops. Yes. You can compare the AI's extraction against the source document immediately. Accuracy is measurable within seconds. Tolerance for error. Moderate. A misextracted field gets caught at the matching step. A missed discrepancy is more serious, but the approval step provides a human checkpoint. Existing digital infrastructure. Yes. Invoices arrive digitally. The ERP system has APIs. Purchase orders are in the database.

This workflow is ready for AI today. A commercial document processing tool could handle it without custom development. The company in this example would likely see measurable time savings within weeks.

Medium readiness: customer support triage

A SaaS company receives 500 support tickets daily. A team reads each ticket, determines the category (billing, technical, feature request, bug report), assigns a priority level, and routes it to the right team. Response time matters. Mis-routing means delays.

Text-based. Yes. Tickets are text. Structured and repeatable. Mostly. The triage logic follows rules, but edge cases are common. A ticket that sounds like a billing question might actually be a bug. Modular. Partially. Classification, prioritization, and routing are separable, but they inform each other. Tight feedback loops. Yes. You can measure routing accuracy and compare AI classifications against human decisions. Tolerance for error. Moderate. A misrouted ticket adds delay but doesn't cause permanent damage. Priority errors are more consequential. Existing digital infrastructure. Yes. The ticketing system has an API. Historical data exists for training.

This workflow scores well on most criteria but has a gap in structure: the edge cases where ticket content doesn't map cleanly to categories. The right approach isn't "automate the whole thing" or "don't automate." It's restructure the triage categories to reduce ambiguity, then automate. Define clearer routing rules. Collapse overlapping categories. Create explicit escalation paths for tickets the AI can't confidently classify. The restructuring makes the workflow more modular, which makes the automation more reliable.

Low readiness: quarterly business review preparation

An executive team prepares quarterly business reviews. The process involves gathering performance data from multiple systems, interpreting trends, identifying strategic implications, drafting narratives that connect the data to business priorities, and preparing recommendations for the board.

Text-based. Partially. The final output is text, but the interpretation depends on context, relationships between metrics, and institutional knowledge about what the numbers mean for this specific business. Structured and repeatable. Barely. The format is consistent but the analysis is different every quarter because the business context changes. Modular. Not meaningfully. The data gathering is separable, but the interpretation, narrative, and recommendations are deeply intertwined. Tight feedback loops. No. You learn whether the analysis was good when the board responds, or when the strategic decisions it informed play out over months. Tolerance for error. Low. A misleading interpretation of performance data could drive the wrong strategic decisions. Existing digital infrastructure. Partially. Data lives in systems, but the synthesis happens in someone's head and a slide deck.

This workflow is not ready for AI as a primary driver. AI can help with pieces of it: pulling data, generating initial charts, drafting summary paragraphs that a human rewrites. But the core work (interpretation, judgment, narrative) is exactly the kind of messy, context-dependent knowledge work where AI produces output that sounds plausible and misses the point.

The right response is not "wait for better AI." It's acknowledge that this workflow lives on the human-judgment end of the spectrum and invest AI resources elsewhere.

The decision framework

Once you've scored your workflows, the scores point to one of three paths.

Automate now. Workflows that score high across most criteria. The technology exists, the workflow is structured, and the feedback loop lets you measure success quickly. These are your first deployments. Pick commercial tools. Get to production. Start learning.

Restructure first, then automate. Workflows that score well on some criteria but have gaps in structure, modularity, or feedback loops. The AI isn't the bottleneck. The workflow is. Fix the process first: make it more modular, define clearer rules, digitize the inputs, establish measurable output criteria. Then automate the restructured version.

Keep human. Workflows that depend on judgment, context, tacit knowledge, or where the feedback loop is too loose to improve the AI over time. These aren't failures. They're accurate assessments. Applying AI here would produce the exact pilot-purgatory pattern that's burning billions across the industry. Redirect those resources to workflows where AI can actually compound.

The hardest part of this framework is the third category. The pressure to "do AI" creates a bias toward automating everything. Resist it. The companies capturing real value from AI (about 3% according to recent research) are the ones that got selective about where to apply it.

The dimension most companies skip

Bill Briggs, Deloitte's CTO, surfaced a statistic that should give every executive pause: companies are spending 93% of their AI budgets on technology and only 7% on people. He called it a critical error.

The readiness scorecard above evaluates the workflow. But workflows don't exist in isolation. They're performed by people. And the people dimension has its own readiness criteria that most evaluations ignore.

Do the people doing this work understand what AI will change about their job? Not in the abstract. Specifically. Which steps will AI handle? Which steps will shift to review and oversight? What new skills does that require?

Is there someone who owns the outcome? Not the AI tool. Not the IT team that deployed it. A business owner who ties the AI's output to a measurable result and is accountable for whether it's working. The research is consistent on this: when AI success is measured in model accuracy, it stays in the lab. When it's measured in a metric someone's bonus depends on, it ships to production.

Does the team have space to learn? AI adoption is a change management problem disguised as a technology problem. If the team is running at 100% capacity with no room to experiment, learn, or adjust their process, the AI tool will sit unused. Sanctioned tools, clear guidelines, and dedicated time to develop proficiency aren't luxuries. They're prerequisites.

Any workflow that scores high on the technical criteria but low on people readiness will stall. You'll buy the tool, run the pilot, show promising results, and then watch adoption flatline because nobody on the team has the time, permission, or understanding to make it part of their actual work.

Four traps that undermine the evaluation

Even with a good framework, companies make predictable mistakes. Four show up repeatedly.

Automating the most annoying process

The workflow that generates the most complaints is rarely the highest-value automation target. It's the one that's most visible. The best candidates are often boring, high-volume processes that nobody thinks about because they just get done. Invoice processing. Data entry. Report generation. Document classification. These aren't exciting, and that's the point.

Skipping the restructuring step

When a workflow scores medium on readiness, the temptation is to deploy AI anyway and "iterate." This is how companies end up bolting AI onto broken processes. Only about 5% of organizations have redesigned workflows around AI capabilities. The rest are automating the existing process, inefficiencies included. The restructuring step feels slow. It's faster than running a pilot that fails because the workflow wasn't ready.

Mistaking tool excitement for workflow readiness

A vendor demo looks incredible. The model summarizes documents in seconds, answers questions about your data, generates reports that would take hours to write manually. None of that tells you whether your specific workflow is ready for that tool. The demo uses clean data, clear prompts, and simple use cases. Your workflow has messy data, ambiguous inputs, and edge cases the demo never showed you. Evaluate the workflow first. Evaluate the tool second.

Evaluating in isolation

Workflows connect to each other. Automating invoice processing changes what the accounts payable team does, which changes how they interact with vendors, which changes the data available for financial reporting. Scoring a workflow in isolation misses these downstream effects. The best evaluations map the workflow in context: what feeds into it, what depends on its output, and who's affected by a change.

Start with one

The Navy recently announced a $448 million investment in AI for shipbuilding. The early results tell a useful story: AI reduced submarine schedule planning from 160 manual hours to under 10 minutes. Material review times dropped from weeks to under an hour.

Those results aren't impressive because of the technology. They're impressive because the workflows had the right characteristics. Schedule planning operates on structured data with clear rules and deterministic output criteria. Material review processes documents against defined standards. These workflows scored high on every readiness dimension before anyone deployed an AI tool.

That's the pattern. The companies seeing real returns from AI didn't start with the most ambitious use case or the flashiest technology. They started by asking which workflows were actually ready. Then they picked one, deployed it, measured the result, and moved to the next.

One workflow in production teaches more than ten in pilot. The evaluation is how you pick the right one.