The Demo-to-Production Gap
I’ve watched dozens of agentic AI demos. They’re compelling: an LLM autonomously plans a sequence of tool calls, retrieves information, synthesizes an answer, and delivers it with apparent reasoning.
Then the demo ends and reality arrives.
Real APIs timeout. Real LLMs hallucinate tool arguments. Real users ask questions the agent wasn’t designed for. Real tasks loop indefinitely. Real costs accumulate.
Here’s what I’ve learned building agentic systems that run in production environments, not just in demos.
Pattern 1: Bounded Autonomy
The most important architectural decision isn’t which orchestration framework to use — it’s how much latitude you give the agent.
Fully autonomous agents (the kind that spin up subagents, write code, and execute shell commands) are almost never the right answer outside of constrained sandboxed environments. The failure radius is too large.
What works: human-in-the-loop gates at high-stakes transitions.
For the Mail Invite Agent I built on Copilot Studio, the agent handles all the information gathering and scheduling logic autonomously. But before it sends any calendar invite or modifies anything external, it presents a summary to the user for confirmation. The critical action — mutation of shared state — requires explicit approval.
This pattern lets you advertise “autonomous” functionality while maintaining safety in practice. The user’s confirmation step is a natural UB checkpoint that catches edge cases you didn’t anticipate.
Pattern 2: Deterministic Tool Contracts
LLMs are probabilistic. Tool calls are not. The boundary between these two worlds is where most agentic failures live.
Define your tools with strict, validated contracts. Don’t let the LLM guess:
from pydantic import BaseModel, Field
from typing import Literal
class CalendarQuery(BaseModel):
"""Query available calendar slots for a given user and time range."""
user_email: str = Field(description="The user's email address")
start_date: str = Field(description="ISO 8601 date: YYYY-MM-DD")
end_date: str = Field(description="ISO 8601 date: YYYY-MM-DD")
duration_minutes: Literal[30, 60, 90, 120] = Field(
description="Meeting duration. Must be exactly 30, 60, 90, or 120."
)
Strict enums, validated formats, explicit type constraints. When the LLM produces a malformed tool call, you want validation to catch it immediately with a useful error message that can be fed back to the model for self-correction.
Permissive schemas that accept anything are a trap. They let the LLM produce plausible-looking but subtly wrong arguments that only fail downstream, far from where you can correct them.
Pattern 3: Explicit State Machines
The appeal of LangGraph is exactly this: it forces you to model your agent as an explicit graph of states and transitions, rather than a vague “chain of thought.”
Compare these two approaches:
Implicit (ReAct loop, single-agent):
- LLM decides what to do next at each step
- State is implicit in the conversation history
- Easy to demo, hard to debug, unpredictable under distribution shift
Explicit (LangGraph state machine):
- States:
GATHER_INFO,PLAN,EXECUTE,VERIFY,COMPLETE,FAILED - Transitions: deterministic based on tool results and conditional logic
- State: typed Pydantic object, explicitly updated at each step
The explicit model costs more upfront. It pays for itself the first time you need to debug a production failure and can replay every state transition from a log.
For complex multi-step workflows (anything with > 3 tool calls or > 2 decision branches), I default to LangGraph. For simple question-answer-with-tool patterns, a simpler ReAct loop is fine.
Pattern 4: Retry Logic With Exponential Backoff
Your agent will call external APIs. Those APIs will fail. This is not an exceptional case — it’s a baseline assumption for production systems.
Structure every tool call with:
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10)
)
async def call_graph_api(endpoint: str, payload: dict) -> dict:
async with httpx.AsyncClient() as client:
response = await client.post(endpoint, json=payload, timeout=10.0)
response.raise_for_status()
return response.json()
Distinguish between retryable failures (rate limit, timeout, 503) and non-retryable failures (401 unauthorized, 400 bad request). Retrying a bad request wastes time and inflates costs.
When tool calls fail after retries, feed the structured error back to the agent. A good error message enables self-correction. A generic exception traceback does not.
Pattern 5: Token Budget Management
Agentic loops accumulate context. Each iteration appends tool results, observations, and the agent’s reasoning. Unchecked, a long-running agent can consume 100K+ tokens — slow, expensive, and increasingly incoherent.
Strategies:
Rolling summarization. After every N steps, compress the earlier context into a structured summary and replace the raw history. Keep the full trace in a persistent store, but pass the model only the summary.
Tool result truncation. Long API responses (search results, document contents, calendar dumps) get summarized or truncated before being added to context. The agent rarely needs the full raw response.
Hard token budget. Set a maximum context size. When the budget is exhausted, the agent must either complete or fail cleanly — not spiral further.
Pattern 6: Observability First
You cannot debug what you cannot observe. Instrument everything:
- Every tool call: inputs, outputs, latency, success/failure
- Every LLM call: prompt length, completion length, model, latency, cost
- Agent state at each step: serialized, stored, queryable
- End-to-end trace per user request
This sounds like overhead. It’s survival equipment.
In production, you will encounter agents that loop indefinitely. Agents that call the wrong tool with plausible-looking arguments. Agents that produce correct outputs in 97% of cases and subtly wrong outputs in 3% of cases. Without traces, you’re debugging blind.
The libraries that make this tractable: LangSmith for LangChain/LangGraph, or a simple structured logging setup with any observability platform.
The Honest Summary
Agentic AI is genuinely powerful for automating multi-step workflows that previously required human judgment and manual orchestration.
It is not magic. It requires careful architecture, defensive engineering, and realistic expectations about reliability. The agents that survive production are not the most autonomous — they’re the most predictable. Autonomy is the goal; predictability is the constraint.
Build the predictable part first. The autonomy can grow from there.