Claude 4 Arrives and Multimodal AI Hits Its Stride

July 2025 saw Anthropic release Claude Sonnet 4 and Opus 4, while multimodal AI capabilities — processing text, images, documents, and audio together — became the new baseline expectation.


Claude 4 Arrives and Multimodal AI Hits Its Stride

July 2025 brought two threads converging: Anthropic's long-awaited Claude 4 family arrived, and the broader AI landscape reached a point where multimodal capability — processing text, images, code, and documents in combination — is now table stakes rather than a differentiator.

Let's look at both.

Claude Sonnet 4 and Opus 4: What Changed

Anthropic released Claude Sonnet 4 and Claude Opus 4 — the next generation of their flagship models.

Claude Sonnet 4

Sonnet 4 is the everyday workhorse of the Claude 4 family: faster and cheaper than Opus, but a meaningful step up from Claude 3.5 Sonnet in reasoning quality and instruction-following.

In our early testing, the improvements that stand out most are:

Better long-document understanding. Sonnet 4 handles nuance across lengthy documents more reliably than its predecessors — it's less likely to "forget" or conflate information from earlier in the context window.

More predictable instruction following. If you've built prompts with complex formatting requirements, structured output expectations, or multi-part instructions, Sonnet 4 is notably more consistent. This matters enormously in production where format adherence affects downstream processing.

Reduced sycophancy. Anthropic has done significant work on making Claude less likely to simply agree with the user. If you push back on a correct response, Sonnet 4 is more likely to hold its ground. For analytical tasks, this reliability is critical.

Claude Opus 4

Opus 4 sits at the top of the capability curve — Anthropic's most capable model to date. It introduces parallel tool use (executing multiple tool calls simultaneously rather than sequentially), which dramatically reduces latency in agentic workflows.

Opus 4 also introduces what Anthropic calls "extended agentic capabilities" — the model can manage longer task horizons more reliably, maintain consistent behaviour across many steps, and handle complex multi-agent coordination scenarios.

The cost is higher than Sonnet 4, making it best reserved for complex, high-value tasks where the capability premium is justified.

Claude Haiku 4

The speed tier of the Claude 4 family — fast, cheap, and capable enough for high-volume simpler tasks like classification, routing, and first-response generation.

Multimodal AI: The New Baseline

While Claude 4's release was the headline, the more significant shift in July 2025 is that multimodal AI stopped being special and became expected.

A year ago, "it can process images too" was a notable feature. Now, every frontier model handles text, images, documents, and increasingly audio as standard. The question has shifted from "can it process images?" to "how well does it reason across modalities?"

What multimodal unlocks in practice

Document processing pipelines. Invoices, contracts, forms, and reports that mix text with tables, charts, and diagrams can now be processed reliably by a single model call — no more separate OCR + parsing pipelines. For clients in finance, legal, and logistics, this is transformative.

Visual QA on operational data. A manufacturing client can photograph a dashboard or piece of equipment and ask "what's the anomaly?" A facilities team can photograph an installation and ask "does this comply with our specifications?" These workflows are now practical.

Richer customer support. When a customer uploads a screenshot of an error, a product photo, or a document they don't understand, a multimodal assistant can address the actual problem rather than asking them to describe it in words.

Meeting and audio intelligence. Audio transcription combined with reasoning — summarise this meeting, extract action items, identify where consensus was reached versus where there's still open disagreement — is now a seamless single-model capability.

The gap that remains

Processing multiple modalities together doesn't automatically mean reasoning well across them. A model can describe an image accurately but struggle to connect what it sees to information in the accompanying text. Evaluation of multimodal reasoning for your specific use case is still essential — don't assume "multimodal model" means "understands your documents."

Choosing Your Model Stack in Mid-2025

With Claude 4, GPT-4.1, Gemini 2.5 Pro, and Llama 4 all available, here's a simplified framework:

Use caseModel recommendation
High-volume customer chatClaude Haiku 4 / GPT-4.1 mini
Document analysis and extractionClaude Sonnet 4 / GPT-4.1
Complex reasoning / analysisClaude Opus 4 / Gemini 2.5 Pro
Private / on-premise deploymentLlama 4 Maverick
Agentic workflowsClaude Opus 4 with parallel tools
Long-document (1M+ tokens)GPT-4.1 / Gemini 2.5 Pro

The most important principle remains the same as it was six months ago: build model-agnostic. The right answer today won't be the right answer in six months. Your architecture should make switching models a configuration change, not a rewrite.

Looking Ahead

The second half of 2025 looks set to be defined by two themes: the maturation of agent systems from prototype to production infrastructure, and the question of how organisations actually measure and govern AI at scale.

Both of those are topics we'll be writing about more in the months ahead.