01
- Harness Engineering
- Agent Engineering
- evals
- AI Agents
Evaluating Agents When the Demo Always Passes
Checking an agent's final answer tells you almost nothing. You have to score the whole trajectory: tool choice, arguments, step count, cost, policy. How to build evals that catch real failures.
Read entry