From Process Managers to Stable Agent Workflows

A Customer Service Email Example

TL;DR

• Agent workflows inherit distributed systems problems — retries, failures, human approvals, long-running steps

• The Process Manager pattern from enterprise integration provides the mental model for stable agents

• Stability comes from explicit state, deterministic routing, idempotent execution, and schema guards — not better prompts

Agent workflows feel new, but the problems they face are not.

Retries, partial failures, human approvals, long-running steps, and external dependencies have existed for decades in enterprise systems. What has changed is that we now place probabilistic systems (LLMs) inside these workflows — which makes stability a first-order design concern.

To understand how to design stable agent workflows, it helps to revisit an older but extremely relevant integration pattern: the Process Manager.

What problem does the Process Manager solve?

The Process Manager pattern exists to coordinate multi-step business processes where:

Steps may not be strictly sequential
Routing decisions depend on intermediate results
Failures and retries are expected
The process must survive restarts

Rather than embedding orchestration logic inside each processing unit, a Process Manager:

Maintains explicit process state
Decides what happens next
Routes work to independent processing units
Resumes from where it left off

Crucially, it treats the process as a long-lived thing, not a request/response interaction.

This turns out to be exactly the mental model required for production agent systems.

From message orchestration to agent orchestration

Agent frameworks replace message handlers with nodes, and message headers with explicit state — but the underlying architectural challenge is the same:

How do we coordinate multiple steps safely when some steps involve unreliable systems, humans, and retries?

The answer is not “better prompts”. The answer is stability patterns.

Stability patterns for agent workflows

These patterns are not about intelligence — they are about survivability.

1. Explicit, durable state

All workflow progress lives in a persisted state object:

Inputs
Intermediate results
Decisions made
Completion flags

State is the contract. If the system restarts, state tells you exactly where you are.

This mirrors the Process Manager’s responsibility for tracking the sequence of steps.

2. Deterministic routing

Control flow is decided by explicit rules, not free-form model output.

Examples:

“Refund”, “complaint”, or “legal” → approval required
Everything else → automated path

LLMs may inform decisions, but they do not own control flow.

This prevents non-reproducible execution paths under retries and load.

3. Single-responsibility nodes

Each step does one thing only:

Classify
Fetch data
Draft response
Send email

No step secretly coordinates others.

This makes:

Retries safe
Failures isolated
Reasoning about behavior tractable

4. Idempotent execution

Every step must be safe to re-run.

Techniques:

Upserts instead of inserts
“Already completed” flags in state
External calls guarded by recorded outcomes

This is how you avoid double-sending emails when a workflow resumes.

Classic Process Manager systems rely on correlation IDs; agent systems rely on state checks.

5. Human-in-the-loop as a first-class pause

Human approval is not a special case — it is a pause in execution.

The workflow:

Records what it is waiting for
Stops executing
Resumes when input arrives

No polling loops. No blocked threads. No fragile callbacks.

This is Process Manager thinking applied to human interaction.

6. Schema guards

LLM outputs are validated before they affect state or trigger actions.

For a customer email reply, that might mean:

Required apology present
Tone constrained to approved values
No promises outside policy

If validation fails:

Repair
Retry
Or escalate

Schema guards are the agent-era equivalent of canonical message models.

7. Backpressure and concurrency limits

Stability requires saying “no” under load.

Examples:

Limit concurrent calls to order systems
Queue excess work
Slow intake rather than cascading failure

This prevents agent workflows from overwhelming downstream systems.

8. Observable execution

Every step emits:

Start
Success
Failure
Decision made

When something goes wrong, you should be able to answer:

Which step failed?
With what state?
What would happen if we resumed now?

Observability is not optional when workflows span minutes, hours, or days.

Example: a stable customer service email workflow

Input:

“My order hasn’t arrived and I need it for Monday. Can you refund shipping?”

Step 1 — Ingest

State initialized:

email_id
customer_id
raw_message
status = "received"

Step 2 — Intent & risk classification

intent = "delivery issue"
risk = "refund request"

Routing rule: refund → approval required

Step 3 — Fetch order details

External system call:

Guarded by retries
Results stored in state

Step 4 — Draft response

LLM generates a reply draft.

Step 5 — Schema validation

Validate:

Apology included
Refund language compliant
Tone acceptable

If invalid → repair or escalate.

Step 6 — Human approval (pause)

Workflow records:

Awaiting approval
Proposed reply

Execution stops.

Step 7 — Resume after approval

Human approves. Workflow resumes exactly where it left off.

Step 8 — Send email (idempotent)

Before sending:

Check send_status == pending

Send email. Record provider message ID. Set send_status = sent.

If the workflow retries later, nothing is sent twice.

Why this is still a Process Manager — just evolved

If you strip away the LLMs, this workflow looks extremely familiar:

A coordinator
Explicit state
Conditional routing
Resumability
Human interaction
Idempotent side effects

The difference is that agent frameworks make this explicit and programmable, rather than implicit and bespoke.

The Process Manager taught us that orchestration is a distributed systems problem. Agent workflows simply inherit that truth — with new failure modes.

Parallel execution: fan-out/fan-in

When steps are independent, they can run in parallel. Here’s an example where we fetch order details and refund policy simultaneously:

In code, this fan-out/fan-in pattern looks like:

@entrypoint(checkpointer=checkpointer)
def handle_email(email: dict) -> dict:
    # Sequential: must classify first
    classification = classify_email(email).result()

    # Fan-out: these are independent, run in parallel
    order_future = fetch_order_details(email["order_id"])
    policy_future = fetch_refund_policy(classification["category"])

    # Fan-in: wait for both to complete
    order_details = order_future.result()
    refund_policy = policy_future.result()

    # Continue sequentially with merged context
    context = {**order_details, **refund_policy}
    draft = draft_reply(email, context).result()

    return draft

The key stability considerations for parallel execution:

Independent failures: If one parallel task fails, decide whether to fail the whole workflow or continue with partial results
Timeout handling: Set timeouts on parallel tasks to prevent indefinite waits
State consistency: Each parallel branch should write to isolated state keys to avoid conflicts

Closing thought

Stable agent systems are not built by chaining prompts.

They are built by applying decades of integration architecture — deliberately, explicitly, and with humility about failure.

Agents Are Still Just Software — why agent systems are fundamentally distributed systems problems
Agents, Routing, Patterns, and Actors — message routing patterns and actor models for agent coordination
Keeping State Consistent: Database Transactions in LangGraph — patterns for synchronizing workflow state with external databases

References

Process Manager Pattern — Gregor Hohpe’s canonical description of the Process Manager pattern from Enterprise Integration Patterns
Thinking in LangGraph — LangChain’s guide to designing workflows with explicit state and conditional routing