From Evaluation to Maintenance

An agent that works well today can start producing different results next week. Not because someone pushed a code change, but because the knowledge it queries changed, or the questions people are asking shifted, or the surrounding context moved. The agent isn't broken. It's running in a different world than yesterday.
Traditional software observability is well understood. Instrument your app, ship metrics, set alerts, respond to incidents. Agents don't fit this playbook. They're probabilistic, context-dependent, and they make decisions based on data that keeps changing.
This is what pushed us to rethink observability for agents.
The Evaluation Limitation
When you're building observability for agents, the obvious move is to build an evaluation framework. Create test datasets, run agents through quality checks, compare outputs against ground truth. The ecosystem has gone in this direction, and the pattern feels familiar from traditional ML.
But evaluation frameworks solve an incomplete problem when they become the main way you understand your agents. Traditional ML evaluation assumes a static dataset with correct answers, that you're tuning parameters to improve a score, and that better test-set performance means better production behavior.
Agents break all three assumptions.
The agents that need observability most aren't processing static datasets. They're querying knowledge bases that change hourly, pulling from documents that teams update constantly, answering questions about data that didn't exist when the agent was configured. A test dataset for an agent that searches quarterly sales data goes stale the moment the quarter ends.
Dataset-based evaluation works for certain things: catching known regressions, validating model upgrades, testing agents with stable use cases. But it can't capture what actually happens in production. An agent might pass every test case and still confuse users with its phrasing, retrieve technically correct but wrong-for-the-moment documents, or fail in ways that only show up with real questions.
The agents worth monitoring most are the ones with side effects. They don't just look things up. They create tasks, create artifacts, send notifications, update records, run multi-step workflows. You can't evaluate them on static test sets because their value comes from actions in live systems, not text generated in isolation.
We could have built dataset-based evaluation into the agent builder anyway. It would have worked for a narrow class of simple, read-only agents.
By building only dataset-based evaluation and making it the primary observability mechanism we would have pushed builders toward agents that fit the evaluation paradigm rather than agents that solve hard problems.
Why we didn't start with LLM-as-judge
Once you recognize that static datasets can't capture production complexity, the natural next step seems obvious: use LLM-as-judge to continuously assess real conversations. Instead of testing against fixed examples, you define quality criteria, "always creates issues in the correct repository," "never exposes sensitive information," and automatically evaluate live traffic against these rules.
This approach goes beyond dataset evaluation because it operates on actual production behavior. It promises to catch issues that test cases miss. The appeal is clear.
We decided not to start here. Three reasons:
Writing reliable probes is extremely hard. Unlike traditional software where assertions are binary (did the function return null?), agent quality depends on context. A response that's helpful in one conversation might be inappropriate in another. Crafting evaluation prompts that reliably distinguish good from bad behavior across the scenarios agents encounter proved harder than we expected. The reliability threshold needed to make probe-based monitoring trusted, where builders actually act on signals rather than learning to ignore them, is very high.
False positives destroy trust. When monitoring flags fine conversations as problems, builders stop paying attention. The system becomes noise. For LLM-as-judge to work as continuous monitoring, it needs to be right consistently enough that alerts are credible. For general-purpose agents handling varied, unpredictable workflows, achieving that consistency felt out of reach.
Optimizing for the wrong proxy does real damage. As we wrote in our evaluation strategy: "Wrong eval is worse than no eval." If builders tune agents to pass automated checks rather than actually help users, monitoring becomes a constraint on what agents can do. The risk isn't just that evaluation is noisy. It's that noise gets treated as signal, and you start optimizing for it.
More fundamentally, agents resist automated quality assessment because the optimization target keeps moving. Unlike traditional software where requirements are built into the system, agents get shaped by how users actually use them. The criteria that matter evolve as teams discover new ways to work with agents, as their knowledge bases change, as the questions they ask shift. Any quality proxy you define today will drift from what actually matters tomorrow. This isn't a limitation of LLM-as-judge specifically; it's a property of agents in production.
Building for Maintenance, not Optimization
Here's what shaped our approach: agents in production need maintenance, not continuous optimization.
The distinction matters. Optimization means you're trying to make something better along some metric. Baseline, changes, measurement, iteration. Upward trajectory on a performance curve.
Maintenance means the agent already works. Your job isn't to make it 10% better on some abstract metric. Your job is to know when something changes that affects reliability, understand what caused it, and decide whether the change is an improvement or a regression.
This shifts the question from "how do I measure agent quality" to "how do I notice when behavior changes, and why."
The answer: correlate metrics with time and with agent versions. If satisfaction drops but you can't tell whether it was the prompt change yesterday, the model swap last week, or the data source reconfiguration two weeks ago, the metric is noise. Every piece of observability data needs context about what changed and when.
Production behavior turns out to be a surprisingly clean signal, one that pre-deployment evaluation can't replicate. When people use an agent for real work, their interaction patterns reveal whether it's actually useful. Message volume, conversation patterns, retry behavior, abandonment rates, user reactions. This is ground truth that synthetic test sets can't give you.
What we built
The agent builder now includes an observability dashboard that shows metrics that were previously invisible to builders. The design principle was zero barrier to entry. No configuration, no instrumentation, no external tools.
Usage metrics show daily message volume, active users, and conversation counts over time. These are proxies for agent value. More importantly, sudden changes often signal problems before anything else does. When an agent that normally handles fifty conversations a day drops to five, something happened.
Feedback distribution tracks the ratio of positive to negative user reactions. Every time you deploy a new version, change instructions, switch models, reconfigure data sources, the system marks that moment. You can see whether satisfaction improved, degraded, or stayed flat after each change. This closes a loop that was previously open: builders can now validate whether their changes actually helped. This metric deserves a closer look. Agents in production generate enormous amounts of observability data: token counts, latency breakdowns, retrieval scores, tool traces. Most of it is low-signal. It tells you what happened, not whether what happened was good. User reactions are different. They're direct human judgment from people doing real work. This signal has higher fidelity than any automated metric because it captures whether the agent actually helped someone. The problem with reactions has always been making them useful. Votes sitting in a database don't help anyone. The value comes from aggregating them over time, correlating them with agent versions, and showing patterns you can act on. When satisfaction drops after a change, you look at what changed and decide whether to iterate or revert. When satisfaction improves, you know the change worked. The metric becomes a feedback loop for maintenance instead of a number nobody looks at.
The tool usage chart became one of our most useful diagnostic views. For agents that use multiple tools, it makes patterns visible that are otherwise buried. It has two modes:
- A version mode letting you compare tool usage as the agent evolves and spot regressions immediately. The chart displays tools by frequency as stacked bars showing success and failure rates across versions. When you update instructions or reconfigure tools, you can see the behavioral impact. If a tool you expected to use more barely gets invoked in the new version, you know the agent's reasoning about when to call it changed, possibly by accident.
- A step mode showing tool usage across execution steps within individual messages, revealing how tools chain together during multi-step reasoning. You can see which steps invoke which tools, whether certain tools get called repeatedly (usually a sign of inefficiency), and how tools compose in workflows. For agents running complex processes, this makes execution flow visible that's otherwise hidden in conversation traces.
Both modes use color-coding for success versus failure. A tool that's mostly failing is immediately obvious, and you can dig into whether it's a tool bug, a config problem, or the agent calling the tool wrong.
Version mode tells you what changed and when. Step mode tells you how the agent thinks. Together they turn tool usage from a black box into something you can actually debug.
The documents retrieved treemap shows which data sources the agent queries most. If your agent should answer questions about engineering docs but keeps pulling from unrelated sources, retrieval configuration needs work.
What this enables
These aren't just metrics to look at. They enable maintenance workflows that were previously impossible.
When you update instructions, you're not deploying blind. You watch behavior over days or weeks. If metrics hold steady or improve, the change is validated. If they degrade, you roll back before the problem compounds.
Builders configure agents to search across multiple data sources: internal docs, conversation history, project files, code repos. Without visibility into what actually gets retrieved, you're guessing. The documents chart shows whether the agent is pulling the right information.
When something breaks in agents that chain multiple tools, figuring out which step failed requires visibility into the execution layer. Step mode gives you that.
You can't A/B test agents directly. Context changes too much, and side effects make parallel testing impractical. But you can deploy version A, observe metrics, deploy version B, and compare. Version markers make this comparison explicit.
Not all agents matter equally. Some are experimental, some rarely used, some are load-bearing infrastructure. Usage patterns show which ones are worth maintaining.
What this is not
A few things worth being direct about.
This is not a pre-deployment evaluation platform. Dataset-based evaluation works well for catching regressions and validating model upgrades, but it belongs in CI/CD pipelines and specialized testing frameworks.
This is not an LLM-as-judge system. Automated quality assessment is brittle. Defining reliable scoring criteria is hard. False positives erode trust. And optimizing for automated scores often makes real-world performance worse.
This is not a replacement for Datadog, Langfuse, or LangSmith. If you need deep tracing, detailed latency breakdowns, or custom metrics, use those tools. We're building observability native to the agent builder workflow, with zero setup, showing signals specific to how agents work on our platform.
This is not about making every agent better. It's about helping builders maintain the agents that matter. In a large workspace, dozens might exist but only a handful are actually load-bearing. The dashboard helps you figure out which ones those are, then helps keep them healthy.
A different philosophy
Building observability for maintenance rather than evaluation is a bet on how agents should work in organizations.
Agents emerge organically. They're not deployed the way traditional software is, planned and tested and released on a schedule. They're created experimentally, shared cautiously, adopted gradually. It's messy, but it's how tools for knowledge work actually spread.
Builders are operators, not just creators. The work doesn't end when the agent starts handling conversations. Data changes, requirements drift, edge cases show up. Keeping agents running well requires visibility into their health.
The agents worth caring about are the ones solving hard problems. Simple agents retrieving static information can get by with minimal observability. Agents that orchestrate workflows, make decisions with partial information, and adapt to changing context need monitoring. We're building for that case.
Production behavior tells you more than synthetic evaluation. You can build elaborate frameworks that score coherence and relevance against test sets. Or you can watch how the agent performs when people use it for real work. The second option is simpler and more reliable.
This worldview shaped what we built: observability that's embedded in the agent builder, requires zero setup, shows metrics you can act on, and treats production usage as the primary signal.
What comes next
This solves the immediate problem: helping builders keep agents running well in production. But it's a foundation. Reaction patterns correlated with version changes build toward understanding what makes agents work. Patterns across all agents in a workspace, which tool combinations work, which data source configs lead to consistent usage, which instruction phrasings correlate with stable behavior, can eventually inform how new agents get built. Better defaults. Suggestions based on what's actually worked.
Version-correlated metrics open the door to real experimentation. Once you can reliably measure the impact of changes, you can test hypotheses. Does increasing temperature improve response quality or introduce inconsistency? Does adding more data sources improve relevance or create noise? Right now these are gut feelings. With version-correlated observability, they become answerable questions.
As we get more confident in passive observability, we will come back to active evaluation. Probe-based monitoring could work for high-stakes cases: compliance, security violations, business rule validation. The point is deploying it selectively, where false positive rates stay low enough to be useful.
When you can see what your agents are doing, understand what's changing, and confirm that your maintenance work is actually improving things for users, you can run agents with confidence.
Want to see how observability works for your team's agents? Start building on Dust