Entry
Evaluating Agents When the Demo Always Passes
Checking an agent's final answer tells you almost nothing. You have to score the whole trajectory: tool choice, arguments, step count, cost, policy. How to build evals that catch real failures.
00 / Masthead
Entry
Checking an agent's final answer tells you almost nothing. You have to score the whole trajectory: tool choice, arguments, step count, cost, policy. How to build evals that catch real failures.
The demo always passes. That is what a demo is for. The interesting question is what happens across the thousand runs nobody watched, and you cannot answer it by eyeballing a few transcripts and deciding they look good.
So you need evaluation. And the way most teams evaluate an agent (check whether the final answer was right) misses almost everything that goes wrong with agents. Evals are the part of the system around the model that tells you whether the other four layers are actually working, and they have to look at more than the last message.
An agent can reach the right answer the wrong way, and the wrong way is what bites you later.
It can call a tool with malformed arguments, get an error, flail for six steps, stumble into the right result, and your final-answer check gives it a green tick. Looks like a pass. It cost five times the tokens it should have, took thirty seconds instead of three, and the only reason it recovered was luck that will not hold next time. Grade only the destination and you are blind to the agent that arrives at the right place by falling down the stairs.
The opposite also happens: the agent does everything correctly and the final answer is wrong because of a bug in a tool. Grade only the answer and you blame the model for the tool's mistake.
The thing worth evaluating is the path, not the endpoint. For each run, I want to know:
Each of these maps to a real production failure. Wrong-tool selection means your tool descriptions are unclear. Invalid arguments mean the same. Step-count blowups mean the agent is looping. Cost spikes mean someone gets a surprising bill. Policy violations mean the guardrails leaked. A trajectory eval turns "the agent feels flaky" into a specific layer you can go fix.
The usual order is: ship the agent, watch it fail in a way nobody predicted, then build an eval for that specific failure. By then the failure has already cost you something.
Better to build the suite in phase one, before failures teach you what to measure, because you already know the failure modes from every other agent you have built. You know it will pick the wrong tool. You know it will loop. You know it will eventually try to do the thing you told it never to do. Write the cases for those now. Then every time production surprises you, the new failure becomes a permanent case, and the suite gets sharper while the agent runs.
This is the same loop as good harness engineering generally: when the agent makes a mistake, you fix the instance and then build the thing that catches the whole class of mistake forever.
An eval suite you run by hand once a month is a comfort blanket. The point is to run it on every change (new prompt, new tool, new model) and block the change if the scores drop.
This matters most when you swap models, which everyone now does constantly. A new model lands, it is cheaper or smarter on paper, and the only honest way to know if it is better for your agent is to run your trajectory evals against it. Sometimes the cheaper model holds its tool-calling discipline and you save money. Sometimes it gets chatty, loops more, and quietly doubles your cost. The eval suite is what turns "the new model feels good" into a number you can defend.
Some things a model cannot score, and pretending otherwise is how you end up with agents that pass every automated check and still produce work no human would accept. Was the code actually idiomatic. Did the design hold together. Would you ship this to a customer.
This is the gap I built gmickel-bench around. Most model benchmarks tell you whether a model can solve a self-contained puzzle. They do not tell you whether it can ship a feature in a real codebase, so the scoring there deliberately combines an automated subtotal with a human review pass for the things that only a person can judge. The automated layer catches regressions cheaply and constantly. The human layer catches the things that matter and cannot be faked. You need both, and you need to be honest about which is which.
An agent you cannot evaluate is an agent you are running on faith. Faith is not a deployment strategy. Score the path, gate the change, and keep the human in the loop exactly where the human is irreplaceable.
Evaluation is one of five layers. The rest (tools, verification, memory, guardrails) are in the field guide to harness engineering.