"I Don't Know Why the Agent Did It" Isn't a Forensic Trail

The hard part of enterprise AI used to be getting it to work. In 2026, the hard part is getting it to work only on what you intended. And being able to prove you did.

A few months ago, a developer posted a screenshot of him asking Chipotle's customer service bot to reverse a linked list in Python. The bot apparently wrote the function, complete with a complexity note, and then politely asked what he wanted for lunch. Chipotle later said the screenshot was photoshopped, and I'll take their word for it. But the reason the post went viral is that nobody on the internet found it implausible. We've all seen versions of this. A support bot for one thing being coaxed into doing a different thing entirely.

That's not a Chipotle problem. It's a structural one. Most corporate chatbots are general-purpose models with a system prompt layered on top. If the system prompt doesn't explicitly prohibit off-topic help, the underlying model defaults to being helpful, because that's what it was trained for.

I've been telling clients lately that development isn't the bottleneck anymore. Getting an agent to do the thing you want is the easy part. The hard part is everything that surrounds it: the guardrails that keep it inside its lane, the observability to know when it strayed, and the feedback loop that lets you teach it where the lane actually is.

That's the discipline shift. And it changes what you wire up first.

Why this matters more in 2026

Two things are happening at the same time.

The first is that users are getting comfortable with AI in ways nobody anticipated. They're not afraid of the bot anymore. They're trying to make it useful. That's a healthy sign. It means adoption is actually working. But it also means your guardrails are being stress-tested by people who genuinely just want help, not adversarial researchers. The Chipotle story isn't a security exploit. It's a developer who needed help with code and asked the nearest helpful thing.

The second is that AI systems take time to learn an organization. Not the technical training, that part is fine. I mean the organizational learning. Knowing which policies apply when. Knowing which questions to refuse. Knowing when "yes" is the wrong answer even though it's technically correct. That kind of knowledge accumulates the way it does for new hires: through corrections, exceptions, and a feedback loop with the people closer to the work. Companies need tolerance for that period. You don't get a fully calibrated agent in week one any more than you get a fully calibrated employee.

If your stack can't tell you what the agent did and why, you can't course-correct it. And without course-correction, the agent doesn't learn. It just keeps making the same mistakes confidently.

The four layers I'd wire up before writing the first prompt

This is the part where most teams skip ahead. They build the agent, ship a demo, and then wonder why production is messy. The wire-up below isn't optional infrastructure. It's the substrate. If it's not there on day one, you spend the next six months retrofitting it.

1. The trace layer

Pick OpenTelemetry. The agent observability story spent the last two years as a fragmented mess. Every vendor had its own trace format, its own dashboard, its own lock-in. That changed this year. The OpenTelemetry community shipped a stable set of conventions specifically for agents and LLM workloads, defining what a trace should capture for an agent's reasoning step, a tool invocation, a model call, and the metadata that has to ride along with each.

The bigger shift is who's catching up. Microsoft's Agent Framework emits these traces natively. Anthropic's Claude Agent SDK exports the same format. Arize, Phoenix, Langfuse, and Datadog all read it. Even Application Insights, which you probably already have if you're on Azure, speaks the standard now. Eighteen months ago every vendor was building a walled garden. This year the walls are quietly coming down, because nobody can sell observability if the trace data isn't portable.

What I look for in a trace, regardless of which platform shows it: every model call captured, every tool invocation with its arguments and output, every agent-to-agent handoff explicit, and the full prompt-and-response stored somewhere I can replay. If your trace doesn't let you replay the run, it's a log, not a trace.

2. The eval layer

The shift here is from grading outputs to grading traces. The output of an agent run can look correct while the path to get there was wrong. The agent picked the wrong tool, the wrong sub-agent, or relied on stale retrieval, but the final answer happened to be right. Output-grading misses this. Trace-grading catches it.

The platforms have moved on this. Microsoft Foundry now ships built-in evaluators that go past quality and safety into agent-specific metrics like tool call accuracy, task completion, and groundedness, with eval results linked directly to the underlying trace so you can drill from a failure score into the run that produced it. Copilot Studio gives makers analytics on what their agents are actually doing in production, which is useful when the people building agents aren't the same people building the underlying platform. OpenAI shipped Trace Grading in the Agents SDK, scoring entire workflow executions instead of just final answers. Anthropic recommends a three-grader pattern (code-based, model-based, and human) running over the full transcript.

Practically, what I want is a golden set of representative runs, a judge model that scores each run against a written behavioral spec, and an eval gate in CI that fails the build if regressions cross some threshold.

At Krish Services, evals are the first exercise we run with a new client, before we write any agent code. Not because we're being academic. Because the eval set is where the conversation about boundaries actually happens. What does success look like? What does failure look like? Which inputs should the agent refuse? Which edge cases matter? Those questions get hand-waved during requirements gathering. They get sharp when you have to write a test that scores them. By the time the eval set is signed off, the team has already had the hard conversations about what the agent should and shouldn't do. The agent is just the implementation.

3. The identity layer

This is the layer most teams don't realize is a layer. They give the agent a service account, point it at the API, and ship. Then a year later they're trying to retrofit who the agent is, what it's allowed to do, and how to revoke it cleanly.

Microsoft made this concrete with Entra Agent ID, which went GA earlier this year. Each agent gets a distinct identity: a real service principal, with scoped permissions, conditional access, and lifecycle management. Microsoft Agent 365 hit GA on May 1 at $15/user, bundling agent governance with Defender, Purview, and Intune. If you're already on Entra, you're already most of the way there. You just have to use it.

Why bother? Because when something goes wrong, "I don't know why the agent did it" isn't a forensic trail. "Agent identity X, acting on behalf of user Y, accessed resource Z at time T" is. That distinction matters the first time legal asks you to prove something.

4. The cost layer

This is the most-promised, least-delivered layer in the industry. Vendors give you usage dashboards at the API key level. They almost never give you cost-per-trace, cost-per-workflow, or cost-per-customer-interaction without you doing the math yourself.

The fix isn't elegant. It's adding token counts as OTel span attributes at every LLM call, then aggregating up the trace. Once you have that, you can answer the question every CFO eventually asks ("which of our agents is burning the budget?") without exporting a CSV from three different dashboards.

I haven't seen a clean off-the-shelf answer to this. If your stack does it natively, you're ahead of most.

The discipline shift, in one line

We used to ask "can we build an agent that does X?" That question is mostly solved.

The new question is "can we build an agent that does X, only X, in a way we can audit, evaluate, and improve over time?"

Everything in the wire-up above is in service of that question. None of it is glamorous. None of it shows up in the demo. But it's what separates an agent that works in week one from one that's still trustworthy in year two.

The Chipotle story is funny because it's harmless. The same pattern with PII is a breach. The same pattern with financial advice is regulatory exposure. The same pattern with internal data is a leak waiting for a screenshot.

Wire up first. Build second. Development isn't the bottleneck anymore. The discipline is.

More on this topic:

AI Agents Observability Enterprise AI Azure