Which AI model should you choose for a Dust agent? Our 2026 guide

Davis Christenhuis

-March 6, 2026

💡 This is the updated version of our original 2024 model comparison guide.

When we first wrote about this in May 2024, the options available in Dust felt manageable. Today, the landscape looks very different.

New models, new players, and some real surprises along the way. We ran the same tests again on Claude, GPT, Gemini, and Mistral. Same tasks and same approach, so you can see exactly how much has changed. Here's what we found.

What has changed since 2024

A lot, honestly. Here's the short version:

More choice at every price point: The model landscape has expanded significantly since 2024. There are now strong options at every tier — from lightweight models built for speed and cost efficiency, to frontier models pushing the limits of what's possible. The trade-off isn't gone, it's just more nuanced. More capable models still come at a higher cost, but teams now have genuine options depending on what the task actually requires.
Multimodal is now standard: Most flagship models today handle text, images, and documents natively. In 2024 this was a differentiator. Now it's expected.
The pace accelerated: Claude went from version 3 to 4.6 in under two years. GPT from 4o to 5.2. Gemini from 1.5 to 3.0. Keeping up has become a challenge in itself.
Models are now built for agents: In 2024, tool use felt like an add-on. Today, agentic behavior is central to how these models are designed. They handle multi-step tasks, sustain long workflows, and manage tool calls much more reliably.
New players entered the space: In 2024, Dust supported four providers. Today there are several more. The market got a lot more competitive.

Claude, GPT, Gemini and Mistral compared

Context window size

Think of the context window as the model's working memory. Everything it can "see" at once (your instructions, the conversation, the documents you've shared) needs to fit inside it. The bigger it is, the more it can hold.

In 2024, Claude's 200K window felt like a lot. Today, Gemini 3 Pro handles 1M tokens, GPT-5.2 offers 400K, and Claude 4.6 (Sonnet & Opus) are hitting 1M in beta. Document analysis tasks that used to require chunking and workarounds now just work.

The trade-off is still the same. Larger context can mean slower responses, and not every token in a big window gets equal attention. But the gap has closed significantly, and for most use cases it's no longer a limiting factor.

Reasoning and analysis

The biggest shift since 2024 is that models can now "think" before they answer. OpenAI, Mistral, Anthropic, and Google all have their own version of this, but the idea is the same: for complex or multi-step problems, the model takes extra time to reason through the answer before giving it to you. In Dust, you can actually see this reasoning process in real time — when you summon an agent, the chain of thought is visible as it works through the problem.

This is no longer a niche capability. A quick email draft doesn't need it. A detailed analysis, a tricky logic problem, or a research-heavy task might benefit from it a lot. Knowing when to lean on it is part of getting the most out of these models in 2026.

As with context window benchmarks, some AI experts point out that reasoning benchmarks can be misleading too. The real test is still how a model performs on your actual work, which is why we ran our own comparisons below.

Tone and personality

This is still somewhat subjective, but the differences are real. These are default behaviors you can adjust with good prompting, but it's worth knowing what you're starting with.

Claude Opus 4.6 writes in the most human-like way of the four. It's warm but not soft — Anthropic has continued to reduce people-pleasing in recent versions, and you feel it. It's more direct than earlier Claude models, sometimes blunt. The writing quality is excellent, and it genuinely adapts its tone to the audience. It just won't always tell you what you want to hear.
GPT-5.2 is the most formal of the group. It's precise and structured, but the default tone can feel cold if you're not used to it. Give it clear instructions on tone and it adapts well. Left to its own devices, it feels more like a tool than a conversation partner.
Gemini 3 Pro is more focused than earlier versions, though it can still overshoot on length when it sees room to add value. The flip side is that it tends to sound more confident than it should be, presenting uncertain information with the same authority as established facts. If you need accuracy over comprehensiveness, keep a close eye on what it tells you.
Gemini 3 Pro is being replaced by Gemini 3.1 Pro in March 2026, so some of these characteristics may shift with the update.
Mistral Large 3 is the most direct of the four. It treats your prompt like an instruction and gets on with it. It is efficient with words, producing dense, readable outputs without unnecessary padding. If you want a model that stays close to the brief and does not over-explain, this is your model.

Note: we tested Claude Opus 4.6 using the Dust agent builder, as the model is not available as a standalone option. We didn't change any agent settings, so it's still a raw model test. See our product update 22 for more details.

The test between the AI models

We tested Claude Opus 4.6, GPT-5.2, Gemini 3 Pro, and Mistral Large 3 on the same tasks: brainstorming, writing, proofreading, factual questions, solving riddles, and document analysis. Same criteria as the original article, with one new addition: a reasoning test.

We added it because reasoning is the biggest shift in how these models work since 2024, and it felt wrong not to put it to the test. We ran each model separately, so you're seeing the raw model performance.

Brainstorming

Task given to the models: Give me three ideas for a company retreat agenda. We're a remote team of 25, spread across different time zones. Keep each idea under 50 words.

Winner: GPT-5.2

Planning a retreat for a distributed team is a real challenge. We wanted to see how each model would handle both the creative and practical sides of the brief. GPT-5.2 came out on top as the only model that genuinely engaged with the time zone constraint. Concepts like "Async First Retreat Week" were smart, relevant, and felt like something a real team would actually want to do.

Claude Opus 4.6 had solid ideas with clear themes and good names, but missed the 50-word limit on two out of three. Hard to compare fairly when the brief isn't followed.
Gemini 3 Pro had decent ideas but also ignored the 50-word limit entirely. The suggestions also leaned generic.
Mistral Large 3 went furthest off brief. It structured each idea as a full five-day schedule and the suggestions felt the most generic of the four.

Writing

Task given to the models: A client is asking for a discount we can't offer. Draft a short, professional email declining their request while keeping the relationship warm.

Winner: Claude Opus 4.6

Writing a good decline email is harder than it sounds. You need to be clear without being cold, and keep the door open without overpromising. Claude Opus 4.6 delivered the most balanced result.

Professional, warm, and structured without feeling templated. It declined clearly, offered concrete alternatives, and closed in a way that genuinely kept the relationship intact. The kind of email you'd actually send without editing.

GPT-5.2 was a close second. Short and direct with no preamble, and it offered smart alternatives like adjusting scope or phasing the rollout. Slightly cold in tone compared to Claude.
Gemini 3 Pro did well this time. Short, polished, and followed the brief. The placeholder text in the alternative options made it feel slightly templated, but the structure was clean.
Mistral Large 3 missed the brief entirely. The email was long, included a bullet point list of options, and read more like a sales pitch than a polite decline. The subject line "Appreciation for Your Business" was an odd choice.

Proofreading

Task given to the models: Please correct the factual and spelling errors in the following text. List each error and your correction.

The Eiffel Tower is located in Berlin and was built in 1901 by Gustave Eiffel for the 1900 World's Fair. It stands at 330 meters tall and was originally intended to be a permenant structure. It is now the most visited paid monument in the wrold, attracting around 7 million visitors per year.

Winner: GPT-5.2

All four models caught all six errors, which says a lot about where proofreading capabilities stand in 2026 compared to 2024. But GPT-5.2 went a step further. It flagged a seventh issue we didn't plant: the visitor number. It noted that "around 7 million visitors per year" is imprecise and that recent figures are closer to 6 million.

Claude Opus 4.6 found all six errors with a clean, well-structured breakdown. Clear and accurate, nothing missing.
Gemini 3 Pro also found all six and added a nice detail — noting that construction ran from 1887 to 1889, not just the completion year. Good extra context.
Mistral Large 3 got all the facts right but presented the results in a table format and slightly conflated two separate errors into one, merging the spelling error on "permenant" with the factual correction to "temporary." A minor but noticeable imprecision in the methodology.

Factual questions

Task given to the models: Explain what AI agents are and how they work, as if you're talking to an 80-year-old with no technical background.

Winner: Claude Opus 4.6

This test is less about accuracy and more about judgment — knowing your audience and adjusting accordingly. Claude Opus 4.6 nailed it. The travel agency analogy was perfectly calibrated for the demographic, and the librarian vs. Google search comparison was clever.

The three-step structure was simple without being condescending, and the closing line — "you don't need to understand how the engine works to ride a car" — was exactly the kind of reassuring, relatable framing an older reader needs.

Mistral Large 3 had some genuinely great analogies, like "a librarian who's read every book" and "a friendly voice on the other end of a phone." But it opened with "Got it! Here's a gentle explanation in 10 sentences or fewer" which broke the tone immediately.
Gemini 3 Pro used a "digital butler" and a vacation planning analogy, which were decent. But the response ran longer and denser than necessary for the audience.
GPT-5.2 was accurate and easy to read but felt more like a technical summary than an explanation to an 80-year-old. Short sentences, no warmth, no analogies tailored to the demographic.

Solving riddles

Task given to the models: I have cities, but no houses live there. I have mountains, but no trees grow there. I have water, but no fish swim there. I have roads, but no cars drive there. What am I?

Winner: Draw

All four got it right, which is itself worth noting. In 2024, this kind of riddle would occasionally trip models up. Today it didn't. Where they differed was in how they answered.

Claude Opus 4.6 answered with "A map!" and a clean one-sentence explanation.
Mistral Large 3 got it right but opened with "This is a classic riddle!" before breaking it down line by line.
Gemini 3 Pro answered with just "You are a map." Nothing more. Correct but no explanation at all.
GPT-5.2 got it right but the response appeared duplicated, with the same explanation printed twice. A minor formatting glitch, but noticeable.

Document analysis

Task given to the models: Please summarize this document in 5 bullet points. Highlight any risks or important clauses worth flagging.

Document used: A Professional Services Agreement between Santa Cruz County Regional Transportation Commission and a consultant, found publicly online.

Winner: Gemini 3 Pro

Gemini delivered the most complete response. It split the output into an executive summary and a separate risks section, referenced specific section numbers from the contract, and flagged five distinct risks with clear explanations.

Claude Opus 4.6 was a close second. It gave five clear bullet points and flagged risks as it went. It stayed within the brief and was easy to skim. A strong pick if you want something concise.
Mistral Large 3 got the content right but went far beyond five bullet points, producing a full multi-section breakdown with dozens of sub-bullets. Useful in a different context, but not what was asked.
GPT-5.2 didn't complete the task at all. Instead of summarizing, it asked clarifying questions about what we wanted done with the contract. Smart questions, but the wrong move when the instruction was clear. Asking clarifying questions can be a strength when the goal is unclear, but when the instruction is explicit, it reads as hesitation.

New: Reasoning

Task given to the models: A company has 3 candidates for 2 open positions. Alice is more experienced than Bob. Charlie is less experienced than Bob. The most experienced candidate always gets the first position. The second position goes to whoever scores highest in the interview. Alice scored lower in the interview than Charlie. Who gets which position?

Winner: GPT-5.2

This one had a twist built in. Position 1 was straightforward — Alice is the most experienced, so she gets it. But position 2 couldn't actually be fully determined. We know Charlie scored higher than Alice in the interview, but we have no information about Bob's interview score. GPT-5.2 was the only model that caught this, correctly stating that position 2 "cannot be determined" and could go to either Bob or Charlie. That's the right answer.

Gemini 3 Pro flagged that Bob’s score was not fully clear with a quick “(Worth noting)” aside, but still landed on Charlie. Halfway there.
Claude Opus 4.6 gave Charlie as the answer and flagged that there was no information about Bob, but still committed to Charlie without enough justification. Close, but not exact enough.
Mistral Large 3 gave Charlie and simply assumed Bob wasn't the highest scorer because no data was provided. That's a logical leap, not a reasoning step.

Combining LLMs with Dust

No model won every category. Every time you run these tests, you might get a slightly different result. Models get updated, prompts get interpreted differently, and context changes everything.

What matters more than which model "wins" is how you use them. A raw model responding to a one-off prompt is only part of the picture. The real value comes when you combine these models with the right setup, the right context, and the right tools.

The right one depends on what you're trying to do, and the only way to find out is to try them yourself. Dust lets you run the same task across Claude, GPT, Gemini, and Mistral side by side, so you can see the differences firsthand and build from there.

Start with the tasks that matter most to your work. Over time, you'll develop a feel for which model fits which job. That instinct is worth building.

💡 See the differences side by side, then turn the best workflow into an agent with your tools. Try Dust free for 14 days →

What has changed since 2024

Claude, GPT, Gemini and Mistral compared

Context window size

Reasoning and analysis

Tone and personality

The test between the AI models

Brainstorming

Writing

Proofreading

Factual questions

Solving riddles

Document analysis

New: Reasoning

Combining LLMs with Dust

Table of contents