Mickel.tech

PlateInk settling

Binningen · CHDE / EN

Harness Engineering: A Field Guide to Building Agents That Hold Up | Mickel Tech

Every AI agent demo works. That is the problem.

You wire up a model, hand it three tools, ask it to do something impressive on stage, and it does. Everyone nods. Then it ships, and on day two it books the same meeting three times, or confidently updates the wrong record, or burns through forty dollars of tokens chasing its own tail in a loop nobody noticed.

The demo measured the model. Production measures everything around it.

I have spent the last two years building these systems and helping engineering orgs across the DACH region build them, and the gap between "it worked in the demo" and "it holds up in real operations" is almost never the model. It is the scaffolding. The model is a component you rent. The agent is the thing you build around it.

The model is the easy part

There is a useful way to frame this that the field has been converging on. An agent is a model plus the system around it, the part that decides what the model sees, what it is allowed to do, and what happens when it gets something wrong. That system has a name now: the harness, and the discipline of building it well is harness engineering. Some people call the same craft agent engineering. The words matter less than the claim underneath them, which is that the harness carries most of the weight.

LangChain published numbers last year showing they moved a coding agent's success rate by double digits without changing the model at all. Same weights, better harness. I see the same thing in my own work and in client engagements: swap the model and you get a few points. Rebuild the system around it and you change what the agent can be trusted to do.

So this is a field guide to that system. Five layers of harness engineering. Each one is where a different class of production failure lives.

Layer 1: tools the model can actually use

An agent acts on the world through tools, and most agents are bad at it because the tools were designed for programmers, not for a model reading a one-line description under load.

A tool called update_record that takes eleven optional parameters and returns a 4KB JSON blob is a trap. The model will guess at the arguments, misread the response, and you will spend a week wondering why it keeps picking the wrong record. The fix is rarely a smarter model. It is a tool that does one thing, names its arguments like a human would, and returns a short answer the model can reason about.

I treat tool design as a writing problem. The description is a prompt. The arguments are the prompt's interface. The return value is feedback the model has to act on, so it should read like a sentence, not a database dump. When I build agents that reach into a customer's systems of record, the tools are deliberately narrow: take the action, log it, hand back a plain confirmation or a plain reason it could not.

This is also why MCP matters more than it first looks. It turns "give the agent a tool" into a contract you can version, test, and reuse across agents instead of re-gluing the same integration four times. I wrote a longer piece on designing tools agents can actually use if you want the specifics.

Layer 2: verification, not vibes

The single most expensive habit in agent building is trusting the agent's own report that it succeeded.

A model writing code and then checking its own code shares its own blind spots. It steered itself into a result and it will steer itself into agreeing the result is fine. Self-testing is self-consistency, and self-consistency is not correctness. I have watched an agent mark a task complete, write a glowing summary, and ship something that did not compile.

The fix is to make success external and mechanical. In Flow-Next, the open-source workflow layer I maintain, nothing counts as done because the agent says so. A different model reviews the work, the review blocks rather than warns, and the task stays open until the reviewer signs off or the system gives up after a set number of attempts and blocks it for a human. I call the strict version of this the untrusted-actor pattern: the agent has to produce proof, and something adversarial has to accept the proof.

For agents acting in the real world, verification looks different but the principle holds. Did the record actually change? Did the email actually send? Check the world, not the transcript. An agent that believes its own narration is an agent that will lie to you cheerfully and without malice.

Layer 3: context and memory as infrastructure

Most production agent failures I diagnose trace back to the same root: the agent did not have the right thing in front of it at the moment it had to decide.

Two failure modes pull in opposite directions. Give the agent too little and it hallucinates the missing piece. Give it too much and it drowns, because context windows only grow during a task, they never forget, and every wrong turn and dead end stays in the window polluting the next decision. The skill is curation: getting the right context in and keeping the wrong context out.

Memory is where teams reach for a vector database and stop thinking. A vector store is one kind of memory, useful for "find me things that are semantically near this." It is a poor fit for "what did the user tell me to always do," or "what happened in the last three steps," or "what is the current state of this case." Real agents need layered memory: a working set for the task at hand, durable preferences, retrieval over a knowledge base, and state that survives a restart.

This is the problem GNO exists to solve on the retrieval side: hybrid search over your own documents and code, keyword and vectors and reranking together, because pure vector search misses exact matches and pure keyword search misses meaning. Retrieval is one layer of memory, not the whole stack. I go deeper in agent memory is not a vector database.

Layer 4: guardrails decide what it is allowed to do

There is something genuinely unsettling about an agent running unattended at 3am with write access to your production systems. Good. Hold onto that feeling, because it is the one that makes you build the fourth layer.

Guardrails are the rules about what the agent may do without asking. They are not a content filter bolted on at the end. They are policy expressed as code: this agent can read these systems and write to none, can refund up to a limit, can escalate to a human on anything it is unsure about, and physically cannot reach the systems it has no business touching. The agent does not get to be trusted by default. It gets exactly the surface area you decided it gets.

In regulated work, this layer is the whole engagement. When I built clinical AI at a Swiss health-tech company, the architecture had on-prem masking of personal data and deterministic pipelines around the model precisely because "the model probably will not do the bad thing" is not a sentence you can put in front of a regulator. The interesting design question is rarely what the agent can do. It is what happens at the boundary, the moment it is uncertain, and whether it hands back to a human or barrels ahead.

Layer 5: observability, or you are flying blind

If you cannot replay exactly what an agent did, you cannot fix it, and you certainly cannot put it in front of an auditor.

A chatbot that gives a bad answer is annoying. An agent takes hundreds of steps, calls tools, branches on their results, and any one of them can be where it went wrong. Logging the final message tells you nothing. You need the trajectory: every tool call, every argument, every result, the cost, the step count, the point where it changed its mind. When something breaks at 3am, that trace is the difference between a ten-minute fix and a day of guessing.

This is also where evaluation lives, because you cannot evaluate what you did not record. The naive version checks the final answer. The version that survives production scores the whole path: did it pick the right tool, pass valid arguments, stay inside policy, finish in a sane number of steps. I built gmickel-bench on exactly this premise, that the question is not whether a model can solve a puzzle but whether the system around it can ship. More on that in evaluating agents when the demo always passes.

Where to start

You do not build all five layers at once. You build the one that is failing.

Find the mistake the agent keeps making. Then engineer the system so it cannot make that mistake again: a narrower tool, a blocking review, a missing piece of context, a guardrail, a trace that shows you what actually happened. Do that every time something breaks and the agent gets more reliable while it runs. That, not the next model release, is the work.

The model will keep getting better, and that is genuinely good news. It just moves the bottleneck up, to the system you build around it. Most teams are still grading the model. The teams shipping agents that hold up are grading everything else.

If you are building agents into a product or an operation and want a second set of eyes on the architecture, that is a large part of what I do. The demo is the easy part. Day two is the job.