AI Agents in Production: What Works, What Doesn't, and What's Next

The word "agent" has become one of the most overused terms in AI. Every demo shows a model autonomously browsing the web, writing code, sending emails, and managing entire workflows. The reality of production deployments is more nuanced, and more interesting, than the demos suggest.

Having spent the first half of 2025 building and deploying agents for a range of clients, I want to share an honest picture of the current state: what's working well, where you'll hit walls, and how to structure agent systems that hold up in the real world.

What "Agent" Actually Means

An AI agent, in the current practical sense, is a system where a language model:

Receives a goal (not just a question)
Decides which tools or actions to take to achieve that goal
Executes those actions
Observes the results
Decides what to do next, repeating until the goal is achieved or it gets stuck

The key difference from a regular chatbot: it takes actions, not just produces text. It might search a database, call an API, run code, write to a file, or send a message, all as part of completing a task.

What's Working Well in 2025

Single-domain agents with well-defined tools

The highest-reliability deployments we've seen are agents with a focused scope and a curated set of tools. A support agent that can query a knowledge base, look up a customer's order history, and update a ticket status, and nothing else, works reliably.

Reliability degrades as you increase: the number of tools, the complexity of the goal, and the number of steps required. The sweet spot right now is 3-7 tools, goals achievable in 5-10 steps.

RAG-augmented agents

Retrieval-Augmented Generation (RAG) as a tool for agents, where the agent decides when to retrieve context and what to retrieve, works extremely well. The model is good at formulating search queries and synthesising retrieved information. This pattern underlies most of the most useful agents we've deployed.

Code interpreter / sandbox execution

Agents that can write and execute code to solve data problems are genuinely impressive. Give an agent access to Python with pandas and matplotlib, a CSV of sales data, and the question "what drove the revenue increase in Q3?", it will write the analysis, run it, check the output, and revise if needed. This kind of "analyst in a box" workflow is delivering real value for teams without deep analytics expertise.

Human-in-the-loop with explicit escalation

Agents that know their limits, and route to a human when confidence is low or an action is high-stakes, are far more reliable in production than fully autonomous agents. Building explicit escalation paths is not a failure of the technology; it's good system design.

Where You'll Hit Walls

Long horizon tasks

Current models struggle to maintain coherent state over very long task chains (20+ steps). Errors compound. Plans drift from the original goal. Token costs escalate. For tasks that require many steps, breaking them into shorter sub-agents with explicit handoffs works better than a single long-running agent.

Unstructured action spaces

If an agent can "do anything on the web" or "do anything in your file system," reliability drops off a cliff. The combinatorial space of decisions becomes overwhelming. Be opinionated about what tools your agent has access to.

Adversarial inputs (prompt injection)

This is an underappreciated risk. If your agent processes content from external sources, web pages, user-submitted documents, emails, malicious content can attempt to hijack the agent's instructions. The field of "prompt injection" attacks on agents is real and deserves serious attention in your security model.

Cost

Agentic workflows are expensive. A task that requires 10 tool calls, each preceded by an LLM reasoning step, can cost 20-50x more than a single RAG query. Instrument your workflows early and set cost budgets. Surprises in the AI infrastructure bill are not fun.

The Emerging Architecture That Works

After iterating on several production deployments, the architecture we return to most often looks like this:

User request
     │
     ▼
Orchestrator (lightweight model, cheap)
     │  Plans the task, routes to specialists
     ▼
Specialist agents (task-specific, curated tools)
  ├── RAG agent (knowledge retrieval)
  ├── Data agent (analysis/calculation)
  ├── Action agent (write/update/send)
  └── Human escalation
     │
     ▼
Response synthesis + confidence check
     │
     ▼
User response (or human handoff)

The orchestrator doesn't do the heavy lifting, it just plans and routes. Specialist agents are scoped narrowly enough to be reliable. Everything is instrumented and logged.

What Azure AI Agent Service Gets Right

Microsoft's GA of Azure AI Agent Service (announced at Build 2025) aligns well with this architecture. The managed infrastructure handles:

Thread state persistence, maintaining context across multi-turn interactions
Tool definition and execution, structured JSON schema for tools, managed execution with retry logic
Built-in observability, Azure Monitor integration with full trace logs
File storage, agents can read from and write to managed file stores

For production deployments, not having to build the state management and observability infrastructure from scratch is a meaningful time saving. We're standardising on it for new client projects.

The Honest Forecast

Agents are not yet fully autonomous. The "AI does everything while you sleep" narrative is ahead of the technology. But "AI agent that handles a specific, well-defined workflow reliably, with humans reviewing outputs and handling exceptions", that is real, deployable, and valuable today.

The organisations winning with agents right now are the ones who started with a narrow scope, instrumented everything, and iterated. If you're waiting for the perfect autonomous agent before starting, you'll be waiting while your competitors are already learning.

Start narrow. Instrument well. Iterate. The rest will follow.