AI ROI: Beyond the Adoption KPI

The ROI question comes up in almost every conversation we have with customers. Not because the value is unclear to them, but because they can't articulate it in a way that holds up internally. They know something has changed. A task that used to take an afternoon now takes twenty minutes. Decisions that required three Slack threads and a meeting now have an answer in seconds. But when leadership asks for the number, teams reach for usage stats and call it a day.
Usage data is actually a meaningful signal. Strong adoption numbers tell a real story about change management, about whether AI has become part of how people work rather than a tool they tried once. That is not nothing, especially in the early phases of a deployment. But it is the beginning of the answer, not the whole thing. At some point, the question shifts from "are people using it" to "what is it actually giving back."
That shift is where most measurement frameworks run out of road. And it is what we have spent a lot of time thinking about at Dust, for our customers and for ourselves.
The question beneath the question
When a leader asks "what is the ROI of our AI investment?", they are rarely asking a pure accounting question. They are asking something harder: is this real? Is it compounding? Can I defend this decision to my board, or is it a line item waiting to be cut?
The pressure is legitimate. The tooling to answer it has not kept pace.
Most organizations reach for what is familiar. Surveys. Rough per-query estimates. Usage dashboards. And occasionally, the gold standard: before-and-after workflow studies that take months to run and cover a fraction of actual usage.
Each of these methods has genuine strengths. Each has a fundamental flaw that makes it insufficient on its own.
Four approaches, four problems
- Self-report surveys are the easiest to deploy. Ask users how much time Dust saves them each week. Aggregate the answers. Present the number. The problem is not the math. It is the inputs. Recall bias is significant. Social desirability inflates responses. And a 20% response rate cannot tell you what 80% of your users actually experience.
- Flat per-query estimates feel more rigorous. You pick a number, say five minutes per conversation, multiply it by total usage, and arrive at a clean total. But a quick clarification question and a three-hour research synthesis are not the same conversation. Treating them identically does not just produce the wrong number. It produces a misleading story.
- Before-and-after workflow benchmarks are the most credible methodology available. They are also the most painful to execute. Someone in ops interviews users department by department, establishes baselines for specific tasks, then measures again after deployment. Done well, with a large enough sample, the numbers hold up under scrutiny. But you only measure the workflows you chose to study. You miss everything that emerged organically. And you have to do it all over again next quarter.
- Usage volume counting is fully automated and completely objective. It also tells you almost nothing about value. One thousand conversations is not one thousand productive outcomes. Power users skew averages. And volume metrics can quietly reward engagement over impact.
The honest conclusion: each of these methods is useful as one signal among many. None of them, alone or combined, gives you a defensible, repeatable, and scalable measure of impact.
A different starting point
We built our own methodology at Dust, not because we thought we could solve the problem entirely, but because we believed there was a better question to ask.
Instead of asking users how much time an interaction saved, we looked at what the agent actually did.
Instead of assigning a flat value to every conversation, we classified each output by the tools it used.
Instead of running expensive studies on selected workflows, we automated the analysis across every response, every month, across every workspace we manage.
The insight behind all of this is simple: the complexity of an agent's output is a reliable proxy for the work it replaced. An agent that answers a general question from its base model saved you a different amount of time than one that synthesized Slack threads, queried a Notion knowledge base, and updated a HubSpot record. Treating them the same misses the entire point.
The four categories
We classify every agent response into one of four categories, based on the tools actually used.
- Basic Interaction: A simple query answered from the agent's base model and custom instructions. No external tools. We estimate three minutes saved per response.
- Personal Productivity: The agent uses personal tools: Gmail, Outlook, Google Calendar, web search, file generation. Around ten minutes saved.
- Company Data Retrieval: The agent draws from shared knowledge sources: Notion, Confluence, Google Drive, Slack, GitHub. This is where Dust tends to deliver the most concentrated value for most teams. Fifteen minutes saved per response.
- Advanced Workflow Automation: Complex queries involving structured data or external system writes: Snowflake, BigQuery, HubSpot, Salesforce, Jira, Linear. These responses often replace thirty minutes or more of manual work.
The calculation follows directly. Count outputs by category over a rolling 28-day window. Multiply each by its time estimate. Sum the hours. Divide by active users to get a per-person metric. From there, you can optionally apply an hourly rate and compare against license cost to produce an ROI multiple.
Steps one through three give you hours saved. That is often the most powerful number to share internally. It is concrete, it is actionable, and it does not require agreeing on an hourly rate, a number that can feel subjective and arbitrary. Steps four and five produce the ROI multiple for executive reporting. Both are valid endpoints. Know your audience.
What the data tells you that ROI doesn't
One of the less obvious uses of this framework is what it reveals about organizational maturity.
At the beginning of a deployment, you tend to see heavy concentration in Basic Interaction. That is expected. Teams are learning, running simple queries, building the habit. But the distribution should shift over time. If the distribution looks the same at month six as it did at month one, something hasn't worked. That migration is a leading indicator. A workspace where 70% of interactions are still Basic after six months has an adoption problem, and probably a use case definition problem. A workspace where higher-value categories are growing month over month is building something durable.
Beyond category mix, we track three other signals that tend to predict long-term value extraction.
- Agent concentration: Are five agents handling 80% of all usage? Or is value spreading horizontally across teams and functions? High concentration is not necessarily bad, but it often signals untapped opportunity elsewhere.
- Human versus API split: Automated workflows generate very different value than human interactions. They replace processes entirely, not just augment individuals. They deserve their own ROI lens.
- Cross-functional agent usage: An agent used by three or more departments is a signal of organizational integration. At Dust, our own handoff agent is shared between the sales team and the customer success team. Handoffs between these two functions have always been difficult to execute well. An agent that makes them reliable does not just save time. It changes how two teams collaborate. That does not show up in any time-saved calculation.
What we don't measure, and why we say so
No methodology is complete. We believe in naming the gaps clearly.
Not every message in a conversation carries the same value. A five-exchange back-and-forth might produce one genuinely useful output. To account for this, we work with approximately 75% of total outputs when calculating time saved. We take the floor, not the ceiling.
Time estimates are fixed by default, not calibrated to your organization. A company where employees are compensated at $200 per hour should apply different weights than one where the average is $50. The framework is built to be adjusted. The defaults are a starting point.
API-driven usage can inflate volume significantly. An automated workflow processing thousands of records per month is real value, but it is a different kind of value than individual human interactions. We analyze them separately to keep the numbers honest.
And quality improvements are not captured at all. A better-drafted proposal, a faster and more informed decision, a meeting where the CSM walked in with full context rather than scrambling five minutes before: these create real value that a time-saved framework will never fully represent.
Our numbers are a floor, not a ceiling. We say that to customers explicitly. Even at half our estimates, the ROI signal holds.
The question isn't whether the ROI is real
It is whether you're building the systems to keep improving it.
AI impact does not plateau on its own. It compounds for teams that measure carefully, use those measurements to shift their category distribution upward, and continuously expand the use cases that generate the highest value. It stagnates for teams that deploy, check adoption numbers, and declare victory.
The measurement framework is not the point. The habit of measurement is.
The question isn't whether you can prove ROI. It's whether you're building the discipline to grow it.
This framework was developed by Dust's Customer Success team. If you'd like to apply this methodology in your organization, reach out to your CSM.