A Customer Service Email Example
• Agent workflows inherit distributed systems problems — retries, failures, human approvals, long-running steps
• The Process Manager pattern from enterprise integration provides the mental model for stable agents
• Stability comes from explicit state, deterministic routing, idempotent execution, and schema guards — not better prompts
Agent workflows feel new, but the problems they face are not.
Retries, partial failures, human approvals, long-running steps, and external dependencies have existed for decades in enterprise systems. What has changed is that we now place probabilistic systems (LLMs) inside these workflows — which makes stability a first-order design concern.
To understand how to design stable agent workflows, it helps to revisit an older but extremely relevant integration pattern: the Process Manager.
What problem does the Process Manager solve?
The Process Manager pattern exists to coordinate multi-step business processes where:
- Steps may not be strictly sequential
- Routing decisions depend on intermediate results
- Failures and retries are expected
- The process must survive restarts
Rather than embedding orchestration logic inside each processing unit, a Process Manager:
- Maintains explicit process state
- Decides what happens next
- Routes work to independent processing units
- Resumes from where it left off
Crucially, it treats the process as a long-lived thing, not a request/response interaction.
This turns out to be exactly the mental model required for production agent systems.
From message orchestration to agent orchestration
Agent frameworks replace message handlers with nodes, and message headers with explicit state — but the underlying architectural challenge is the same:
How do we coordinate multiple steps safely when some steps involve unreliable systems, humans, and retries?
The answer is not “better prompts”. The answer is stability patterns.
Stability patterns for agent workflows
These patterns are not about intelligence — they are about survivability.
1. Explicit, durable state
All workflow progress lives in a persisted state object:
- Inputs
- Intermediate results
- Decisions made
- Completion flags
State is the contract. If the system restarts, state tells you exactly where you are.
This mirrors the Process Manager’s responsibility for tracking the sequence of steps.
2. Deterministic routing
Control flow is decided by explicit rules, not free-form model output.
Examples:
- “Refund”, “complaint”, or “legal” → approval required
- Everything else → automated path
LLMs may inform decisions, but they do not own control flow.
This prevents non-reproducible execution paths under retries and load.
3. Single-responsibility nodes
Each step does one thing only:
- Classify
- Fetch data
- Draft response
- Send email
No step secretly coordinates others.
This makes:
- Retries safe
- Failures isolated
- Reasoning about behavior tractable
4. Idempotent execution
Every step must be safe to re-run.
Techniques:
- Upserts instead of inserts
- “Already completed” flags in state
- External calls guarded by recorded outcomes
This is how you avoid double-sending emails when a workflow resumes.
Classic Process Manager systems rely on correlation IDs; agent systems rely on state checks.
5. Human-in-the-loop as a first-class pause
Human approval is not a special case — it is a pause in execution.
The workflow:
- Records what it is waiting for
- Stops executing
- Resumes when input arrives
No polling loops. No blocked threads. No fragile callbacks.
This is Process Manager thinking applied to human interaction.
6. Schema guards
LLM outputs are validated before they affect state or trigger actions.
For a customer email reply, that might mean:
- Required apology present
- Tone constrained to approved values
- No promises outside policy
If validation fails:
- Repair
- Retry
- Or escalate
Schema guards are the agent-era equivalent of canonical message models.
7. Backpressure and concurrency limits
Stability requires saying “no” under load.
Examples:
- Limit concurrent calls to order systems
- Queue excess work
- Slow intake rather than cascading failure
This prevents agent workflows from overwhelming downstream systems.
8. Observable execution
Every step emits:
- Start
- Success
- Failure
- Decision made
When something goes wrong, you should be able to answer:
- Which step failed?
- With what state?
- What would happen if we resumed now?
Observability is not optional when workflows span minutes, hours, or days.
Example: a stable customer service email workflow
Input:
“My order hasn’t arrived and I need it for Monday. Can you refund shipping?”
Step 1 — Ingest
State initialized:
email_idcustomer_idraw_messagestatus = "received"
Step 2 — Intent & risk classification
intent = "delivery issue"risk = "refund request"
Routing rule: refund → approval required
Step 3 — Fetch order details
External system call:
- Guarded by retries
- Results stored in state
Step 4 — Draft response
LLM generates a reply draft.
Step 5 — Schema validation
Validate:
- Apology included
- Refund language compliant
- Tone acceptable
If invalid → repair or escalate.
Step 6 — Human approval (pause)
Workflow records:
- Awaiting approval
- Proposed reply
Execution stops.
Step 7 — Resume after approval
Human approves. Workflow resumes exactly where it left off.
Step 8 — Send email (idempotent)
Before sending:
- Check
send_status == pending
Send email.
Record provider message ID.
Set send_status = sent.
If the workflow retries later, nothing is sent twice.
Why this is still a Process Manager — just evolved
If you strip away the LLMs, this workflow looks extremely familiar:
- A coordinator
- Explicit state
- Conditional routing
- Resumability
- Human interaction
- Idempotent side effects
The difference is that agent frameworks make this explicit and programmable, rather than implicit and bespoke.
The Process Manager taught us that orchestration is a distributed systems problem. Agent workflows simply inherit that truth — with new failure modes.
Parallel execution: fan-out/fan-in
When steps are independent, they can run in parallel. Here’s an example where we fetch order details and refund policy simultaneously:
In code, this fan-out/fan-in pattern looks like:
@entrypoint(checkpointer=checkpointer)
def handle_email(email: dict) -> dict:
# Sequential: must classify first
classification = classify_email(email).result()
# Fan-out: these are independent, run in parallel
order_future = fetch_order_details(email["order_id"])
policy_future = fetch_refund_policy(classification["category"])
# Fan-in: wait for both to complete
order_details = order_future.result()
refund_policy = policy_future.result()
# Continue sequentially with merged context
context = {**order_details, **refund_policy}
draft = draft_reply(email, context).result()
return draft
The key stability considerations for parallel execution:
- Independent failures: If one parallel task fails, decide whether to fail the whole workflow or continue with partial results
- Timeout handling: Set timeouts on parallel tasks to prevent indefinite waits
- State consistency: Each parallel branch should write to isolated state keys to avoid conflicts
Closing thought
Stable agent systems are not built by chaining prompts.
They are built by applying decades of integration architecture — deliberately, explicitly, and with humility about failure.
Related Posts
- Agents Are Still Just Software — why agent systems are fundamentally distributed systems problems
- Agents, Routing, Patterns, and Actors — message routing patterns and actor models for agent coordination
- Keeping State Consistent: Database Transactions in LangGraph — patterns for synchronizing workflow state with external databases
References
- Process Manager Pattern — Gregor Hohpe’s canonical description of the Process Manager pattern from Enterprise Integration Patterns
- Thinking in LangGraph — LangChain’s guide to designing workflows with explicit state and conditional routing