AI Evals: The App-Level vs. Agent-Level “Controversy”

Pick your side?

Jan 05, 2026

Happy New Year! Before the holidays, I wrote an article on AI Evals vs. Analytics, discussing the role of AI evals before release, the strength of analytics after release, and how the two work together to drive successful AI products.

I’m hosting a 30-minute lightning lesson on this topic this Friday, Jan 9, at 8am PT. Sign up here to join hundreds of AI builders to learn how to design evaluation metrics that actually support real AI products.

Our next live cohort of the AI Evals and Analytics Playbook course starts on Jan 17.
Enjoy a 35% discount with code DSxAI.

From Blackbox AI to Agentic Systems

It turns out that Evals vs. Analytics is not the only debate in this space. Another emerging discussion is around app-level vs. agent-level eval metrics.

Before agentic AI, AI models and products were reasonably treated as single-step black boxes. You give the system an input, observe the final output, and evaluate the product based on that output alone.

In these systems, two requests can produce the same final answer while following very different internal paths, some efficient and robust, others fragile, expensive, or unsafe. Treating these systems as black boxes creates blind spots and makes debugging significantly harder.

Before jumping into the app-level vs. agent-level discussion, let’s first define what these metrics are.

App-Level vs. Agent-Level Eval Metrics

App-level metrics evaluate the end-to-end behavior of an AI product. They treat the system as a black box and focus on outcomes that matter to users or the business.
App-level metrics answer the question:

Did the user get the right result, quickly and reliably?

Agent-level metrics evaluate internal agent behavior and decision-making. They require visibility into intermediate steps, traces, or tool calls.
Agent-level metrics answer the question:

How did the system arrive at the final output, and can we trust that process?

An Agentic Text-to-SQL Example

So, have app-level eval metrics become irrelevant?

Let’s look at an agentic Text-to-SQL example. This is an internal application built to help non-technical employees self-serve their ad hoc analytics needs.

The system follows the workflow shown in the diagram above:

A Request Intake agent gathers information from the user and ensures the request is well-specified.
A Schema Discovery agent maps the request to the relevant database schema.
A SQL Generation agent produces the query.
An Execution & Validation agent runs the query and checks results.
If execution fails, the request is routed to an Error Correction agent before results are returned to the user.

In this agentic design, there are many evaluation signals available, some app-level, others agent-level.

Pick a side?

It’s entirely possible to have two systems that return identical SQL results, while one requires five retries and the other succeeds on the first attempt. App-level metrics alone cannot distinguish between them.

At this point, focusing on agent-level metrics may feel very tempting.

But no, you need both, depending on the use case.

Different Use Cases Require Different Metrics

In practice, metric choice should be driven by your use case, and the decision you are trying to make.

Development and Iteration

(Evaluation-Driven Development)

Primary question: Where should we improve?

During development, agent-level metrics are far more useful than app-level metrics. Early systems often have sparse or noisy end-to-end success signals, while most improvements are local and behavioral.

In a text-to-SQL system:

Task success may remain at 70%
But schema hallucination drops from 20% to 5%
Retry count drops from four to one

App-level metrics barely move, yet the system becomes dramatically more robust. Agent-level metrics guide development by revealing where progress is actually being made.

CI/CD and Guardrails

Primary question: Did anything change in an unintended way?

CI/CD evaluation is about stability, regression detection, and guardrails.

In text-to-SQL:

The query still returns correct results
But token usage doubles due to a new error-correction loop

Agent-level metrics catch this early, before it turns into a production cost, latency, or reliability issue.

Pre-Release Pass/Fail Gates

Primary question: Is this version acceptable to ship?

Pre-release decisions are product decisions, which makes app-level metrics the right primary signal.

For a text-to-SQL system, a release gate may require:

≥95% task success rate
Latency within SLA
Error rate below threshold

Internal inefficiencies may still exist, but if they do not violate user-visible constraints, the version can ship. Pre-release gates are about outcomes, not internal elegance.

In-Production Monitoring

Primary question: Is the system still behaving acceptably in the real world?

In production, app-level metrics are critical for detecting hard failures and user impact:

Latency spikes
Error rate increases

However, app-level metrics alone cannot capture all signals in production. This is where product analytics come together.

As discussed in my earlier article on AI Evals vs. Analytics, evals help define expected behavior, while analytics reveal real-world usage patterns. Used together, they enable effective monitoring and diagnosis in production.

The Unifying Principle

App-level metrics and agent-level metrics serve different roles:

Agent-level metrics optimize the system
App-level metrics validate the product

As AI systems become more agentic, evaluation must become more layered. Using the wrong metrics for the wrong decision leads to noisy signals, slow iteration, and misaligned incentives.

The goal is not to choose one level of evaluation, but to use the right level at the right time.

If you are building real AI products, the question is no longer whether to invest in AI evals, but how to design them so they meaningfully support product decisions.

If this topic resonates, the next live cohort of AI Evals and Analytics Playbook starts on Jan 17. The course focuses on designing evaluation systems for real-world AI products: before release, after release, and across the full product lifecycle.

Data Science x AI

Discussion about this post

Ready for more?