Why AI Agent Evals Belong in the Harness, Not a Dashboard

David AlfordMay 22, 202612 min read

Cursor published a number in late 2025 that should have ended the AI agent debate. Same Claude model, two different agent harnesses, and the score on their internal benchmark moved from 46% to 80%. A 34-point swing on identical model weights. The harness was doing the work, not the model.[1]

If that’s true, and the harness rather than the model determines whether a production agent is reliable, then AI agent evals belong inside the harness. Not in a separate dashboard. Not in a CI step that runs in your customer’s repo. Inside the same runtime that ships the agent, with the authority to block a deploy when a workflow version regresses against the dataset that grew out of last quarter’s customer complaints. That’s the bet we made when we built the TaskJuice eval system, and the reason this post exists is that we keep getting asked why we didn’t just bolt on Braintrust like everyone else.

What Are AI Agent Evals?

An AI agent eval is a test that scores an agent’s behavior on a fixed input against a known expectation. It evaluates the full trajectory: which tools the agent called, what arguments it passed, what output it produced, and whether the result satisfies a rubric. Model benchmarks measure raw capability. Agent evals measure what your agent actually did when it ran.

A working agent eval scores some combination of the following:

Trajectory. Did the agent’s path through the task match the expected sequence of steps, or did it loop, give up, or take a shortcut that produced the right answer for the wrong reason?
Tool-call correctness. Were the right tools called, in the right order, with arguments that match an expected shape? This is the part deterministic scorers handle well, with no LLM-as-a-judge call needed.
Output schema. Did the structured output match the contract the downstream system expects? Three extra words in a system prompt can break this without breaking anything visible upstream.
Rubric-graded quality. For open-ended outputs, an LLM-as-a-judge or a human grader scores the result on a multi-metric rubric. Inter-rater agreement is itself measurable and worth tracking.
Regression against a baseline. Did this run score worse than a rolling window of recent passing runs? This is the metric that catches silent drift, and it’s the metric most bolted-on tools surface late.

A model eval tells you whether the model can answer the question in principle. An agent eval tells you whether your agent answered it in production. The two measure different things, and a passing model benchmark says almost nothing about whether your agent shipped a correct refund to the right customer this morning.

Bolted-On Eval Tools Observe. They Don’t Block.

Bolted-on eval tools are dashboards. They sit outside the runtime, ingest your agent’s traces over the wire, and surface scores after the fact. Braintrust, LangSmith, Arize, Langfuse, Phoenix, Humanloop, DeepEval, and Promptfoo all share this shape. They’re observatories. They’re useful. But by the time the score shows up on the dashboard, the deploy already shipped.

The natural counter is that all of these tools support some flavor of CI integration. Braintrust has a GitHub Action that runs evals on every pull request and posts results as a comment. That works fine when there’s one engineering team, one repo, and one deploy pipeline. It doesn’t work when you’re a platform running agents on behalf of a hundred different agencies, each with their own workflows, their own clients, and no shared CI. The gate has to live where the agents actually deploy. For TaskJuice that’s the publish step on a workflow version, not a customer’s GitHub Action.

A line we keep coming back to, from a Dev.to piece titled "Why your agent’s eval suite won’t catch production failures": "Eval suites are snapshots while production is a stream." A dashboard tells you the snapshot you took two days ago looks fine. A gate refuses to let the next snapshot ship until it scores as well as the last one. Both are real tools. They do different jobs.[2]

The Real Failure Mode Is Silent Prompt Drift

Prompt drift is the failure mode that kills production agents, and it’s almost invisible from the outside. Comet defines it as "a slow, almost invisible degradation in performance that eventually tips into sharp, painful failure," caused by "accumulated small, unrecorded prompt changes over time." The drift compounds across edits nobody flagged as risky.[3]

Latitude documented a postmortem pattern that maps onto half the agent regressions I’ve seen in the wild. Three words added to a system prompt to make the agent’s tone more conversational. Structured-output error rates spiked within hours. Revenue-generating workflows halted until engineers manually rolled back. The dashboard caught it eventually, after the customer complaints came in. A bolted-on eval tool would have scored the new prompt the next day on its scheduled run, by which point the rollback had already happened.[4]

What catches drift before it ships is a gate, not a graph. Score the new prompt against the same dataset the old prompt cleared. If the score drops below the regression threshold, refuse to publish. Force the operator to look at the failing examples, decide whether the drop is intentional, and either fix the prompt or update the baseline on the record. This is harder to retrofit than it sounds, because the gate needs to live inside the deploy primitive itself, which is the part the eval tool vendors don’t own.

Evals Are the CI Gate of 2026

Treating evals as the CI gate for AI is no longer a niche opinion. Mike Krieger, Anthropic’s CPO, said in a panel with Kevin Weil and Sarah Guo that "if there is one thing we can teach people, it’s that writing evals is probably the most important thing." Greg Brockman has called evals "the most overlooked skill in machine learning." Garry Tan, in a tweet that traveled wider than most YC essays: "evals are emerging as the real moat for AI startups." Three of the most cited voices in the field, three different framings, same conclusion.[5][6]

Anthropic’s engineering team formalized the practice as eval-driven development: "build evals to define planned capabilities before agents can fulfill them, then iterate until the agent performs well." Their floor for getting started is concrete: "20 to 50 simple tasks drawn from real failures is a great start. Evals get harder to build the longer you wait."[7]

Hamel Husain pushes back on the strict TDD-style reading, and the steelman is worth taking seriously. Don’t write evaluators for failures you imagine. Write them for failures you’ve already discovered. We agree with that version. The golden datasets we recommend agencies build inside TaskJuice are seeded from the failures their own agents have shipped, not from speculative edge cases an engineer dreamed up on a Tuesday.[8]

The synthesis we land on: evals belong in the development loop, and for an agent platform, the development loop is the publish step on a workflow version. That’s where eval-driven discipline turns into a hard gate, instead of a habit the team has every intention of keeping right up until the demo.

Per-Node and Per-Tenant: Two Properties We Refused to Compromise On

TaskJuice evals are scoped two ways that most off-the-shelf eval tools cannot match. They run per node, meaning a suite can target a single LLM step inside a larger workflow and replay only that step in isolation. And they run per tenant, meaning every suite, dataset, and run is scoped to a single workspace at the access-control layer, with no cross-tenant leakage possible.

Per-node evals matter because real workflows are not single-step agent loops. A typical TaskJuice workflow has deterministic transformation nodes, integration nodes that hit external APIs, and one or two AI nodes that classify, extract, or generate. The AI nodes are where evals belong. When an operator runs a per-node suite, the runtime replays just that one step against synthetic inputs from the dataset, with a guard middleware that blocks any downstream tool side effects. The rest of the workflow doesn’t fire. No emails get sent. No Slack messages get posted. No records get written to a customer’s CRM. Just the LLM call, scored against the rubric.

Per-tenant evals matter because TaskJuice is a multi-tenant platform from the floor up. Every eval suite, dataset, and run carries a required tenant relationship, enforced by the same access-control rule we use for every other workspace-owned resource. An agency’s golden dataset for their finance client cannot be queried by another agency. Their publish-gate verdict cannot be activated by another tenant’s run. The idempotency claim keys are namespaced by tenant ID before anything else.

Both properties are weak on their own. Per-node without tenant scoping leaks agency data. Per-tenant without per-node forces you to score the whole workflow every time, which is slow, expensive, and noisy. Together, they let an agency operator lock down quality on the single AI step that matters for one client’s workflow, without that work being visible or accidentally reusable across the rest of their book of business.

If You Run an AI Automation Agency, This Is the Moat

If you run an AI automation agency, your eval set per client per workflow version is the asset you’re actually selling. Not the workflow. Not the prompt. The accumulated knowledge of what "correct" looks like for that specific client’s edge cases, encoded as a dataset that gates every deploy. It’s the deliverable nobody else can rebuild from your final invoice.

Garry Tan’s "evals are the moat" thesis was written about AI startups, and it applies more sharply to agencies. A startup with one product has one golden dataset. An agency with twenty clients has twenty, and each one represents weeks of error analysis with that client’s domain experts. Their refund policy. Their tone-of-voice rules. The five categories their support tickets actually fall into versus the eleven categories they told you about in onboarding. None of that information existed before you wrote it down as eval examples.

Bolted-on eval SaaS cannot own this asset for you. They don’t have your tenant graph. They don’t know which client a given prompt belongs to. They can score outputs, and they can show you graphs, but they can’t gate deploys per client per workflow version because they’re not inside the deploy. When you stop paying them, the moat goes with them.

This is why TaskJuice’s eval system lives where the workflows live. Every agency that ships a workflow on TaskJuice accumulates a per-client eval set as a side effect of the work they’re already doing. The dataset grows from the failures their agents have actually shipped. The operator is the agency’s own engineer. The gate sits in front of the publish button their team already uses. The moat is the byproduct of doing the work, not a second SaaS subscription.

How the Eval Gate Actually Works

When an operator hits publish on a TaskJuice workflow version, the platform looks for eval suites with their block-publish flag set. If any exist, the version enters a pending-eval state instead of going live. The orchestrator dispatches every example across every blocking suite. Judges score the outputs. A regression detector compares the aggregate to a rolling baseline. The workflow version only activates when every suite returns a passing verdict with no error-severity regression.

Concretely, the moving parts are:

Type-safe judge kinds. Four judge kinds are enumerated at the type level: a rubric-based LLM-as-a-judge, a trajectory judge for multi-turn agents, deterministic threshold metrics, and a tool-call-precision scorer that compares actual tool invocations against an expected sequence with exact, JSON-subset, or regex match modes. Adding a new kind surfaces as a type error everywhere a judge is consumed, so the runtime cannot silently ship a half-wired scorer.
Per-node or whole-workflow scope. A suite carries an optional target-node ID. If set, the suite is per-node and replays just that step. If absent, it scopes to the entire workflow. Both modes share the same datasets, the same judge contracts, and the same publish gate.
Content-addressed prompt hashing. Every prompt is hashed with SHA-256 over a canonical representation that ignores variable ordering and preserves body whitespace. A prompt change produces a new hash, so a regression is always traceable to a specific version, and a rollback resolves to a single content address rather than a fuzzy revert.
Eval guard middleware. During an eval run, the runtime activates a middleware that blocks downstream tool side effects. The LLM call fires and the scorer runs, but no email gets sent, no record gets written, no Slack message gets posted. This is what makes it safe to replay real production prompts on real client data.
Rolling baseline regression detection. A passing run is compared against a configurable window of recent passing runs for the same suite and the same node. The detector emits a severity band. Error-severity drops block publish. Warning-severity drops surface to the operator without halting the deploy, so the human gets to decide what to do about it.
Idempotency lifecycle on the gate. Gate activation is claimed once per tenant ID and publish-gate ID, so concurrent retries cannot double-activate a workflow version. The same pattern guards every irreversible step in the run lifecycle, which is the only sane way to operate a runtime where any handler can fire twice.

None of these are individually unprecedented. The combination, scoped per node and per tenant and gating the publish step itself, is what turns a dashboard into a gate.

Frequently Asked Questions

What is an AI agent eval?

An AI agent eval is a test that scores an agent’s behavior on a fixed input against a known expectation. It evaluates the full trajectory: which tools were called, with what arguments, in what order, and whether the final output satisfies a rubric. The most common shapes are LLM-as-a-judge for qualitative grading, deterministic metrics for structured-output correctness, and tool-call-precision scorers for action sequences.

What’s the difference between agent evals and model evals?

Model evals measure a foundation model’s raw capability on standardized tasks like MMLU, HumanEval, or SWE-Bench. They run against the model in isolation. Agent evals measure what your agent did when it was wired into a runtime with tools, prompts, retries, and your business logic. A model that scores 90% on a benchmark can still ship the wrong refund to the wrong customer if the agent harness around it is broken. Agent evals catch the second case. Model evals don’t.

What is eval-driven development?

Eval-driven development is the practice of writing evaluators alongside the agent code they grade, then using those evaluators as the primary signal for whether a change is ready to ship. Anthropic’s engineering team formalized the phrasing in 2025. Braintrust, Red Hat, and others have published reference architectures around the idea. The intent is to make the eval the contract, the way unit tests are the contract for traditional code, while accepting that eval grades are continuous scores rather than binary pass-fail.

Are AI evals the same as unit tests?

They share a job, which is keeping bad code from shipping. The shape is different. Unit tests are binary: pass or fail. Evals are scored: 87% this week, 71% next week, with a regression band that says whether the drop crosses a threshold. Unit tests assume deterministic code. Evals assume stochastic outputs and grade across a dataset large enough to wash out the noise. The discipline rhymes. The data model doesn’t.

The previous post in this series argued that the agent harness is the product, not the model. The eval gate is the part of the harness that makes that argument defensible. Without it, every other piece of harness engineering, the metering and the audit logs and the compaction and the prompt versioning, is a story you tell without proof. With it, the proof ships every time a workflow version reaches the publish step and survives the gate.

TaskJuice runs agents on behalf of a lot of different agencies, each with their own clients and their own definitions of correct. Building evals into the harness, scoped per node and per tenant, is the only architecture that lets us host that complexity without one client’s regression showing up in another client’s bill. The model will keep improving. The harness will keep absorbing the complexity the model can’t. The eval gate is where we get to prove the harness did its job.

References

[1] Continually Improving the Cursor Agent Harness, Cursor: cursor.com/blog/continually-improving-agent-harness

[2] Why Your Agent’s Eval Suite Won’t Catch Production Failures, Devon Akelley: dev.to/devonakelley/why-your-agents-eval-suite-wont-catch-production-failures-32ip

[3] Prompt Drift, Comet: comet.com/site/blog/prompt-drift/

[4] Prompt Rollback in Production Systems, Latitude: latitude.so/blog/prompt-rollback-in-production-systems

[5] A Conversation with Kevin Weil, Mike Krieger, and Sarah Guo, Lenny’s Podcast (summary): getrecall.ai/summary/ai/a-conversation-with-kevin-weil-openai-cpo-mike-krieger-anthropic-cpo-sarah-guo-conviction

[6] Evals Are Emerging as the Real Moat for AI Startups, Garry Tan: x.com/garrytan/status/1892952656940880036

[7] Demystifying Evals for AI Agents, Anthropic: anthropic.com/engineering/demystifying-evals-for-ai-agents

[8] Should I Practice Eval-Driven Development? Hamel Husain: hamel.dev/blog/posts/evals-faq/should-i-practice-eval-driven-development.html