Skip to content
Go back

From Process Managers to Stable Agent Workflows

Published:  at  10:00 AM

A Customer Service Email Example

TL;DR

• Agent workflows inherit distributed systems problems — retries, failures, human approvals, long-running steps

• The Process Manager pattern from enterprise integration provides the mental model for stable agents

• Stability comes from explicit state, deterministic routing, idempotent execution, and schema guards — not better prompts

Agent workflows feel new, but the problems they face are not.

Retries, partial failures, human approvals, long-running steps, and external dependencies have existed for decades in enterprise systems. What has changed is that we now place probabilistic systems (LLMs) inside these workflows — which makes stability a first-order design concern.

To understand how to design stable agent workflows, it helps to revisit an older but extremely relevant integration pattern: the Process Manager.


What problem does the Process Manager solve?

The Process Manager pattern exists to coordinate multi-step business processes where:

  • Steps may not be strictly sequential
  • Routing decisions depend on intermediate results
  • Failures and retries are expected
  • The process must survive restarts

Rather than embedding orchestration logic inside each processing unit, a Process Manager:

  • Maintains explicit process state
  • Decides what happens next
  • Routes work to independent processing units
  • Resumes from where it left off

Crucially, it treats the process as a long-lived thing, not a request/response interaction.

This turns out to be exactly the mental model required for production agent systems.


From message orchestration to agent orchestration

Agent frameworks replace message handlers with nodes, and message headers with explicit state — but the underlying architectural challenge is the same:

How do we coordinate multiple steps safely when some steps involve unreliable systems, humans, and retries?

The answer is not “better prompts”. The answer is stability patterns.


Stability patterns for agent workflows

These patterns are not about intelligence — they are about survivability.

1. Explicit, durable state

All workflow progress lives in a persisted state object:

  • Inputs
  • Intermediate results
  • Decisions made
  • Completion flags

State is the contract. If the system restarts, state tells you exactly where you are.

This mirrors the Process Manager’s responsibility for tracking the sequence of steps.

2. Deterministic routing

Control flow is decided by explicit rules, not free-form model output.

Examples:

  • “Refund”, “complaint”, or “legal” → approval required
  • Everything else → automated path

LLMs may inform decisions, but they do not own control flow.

This prevents non-reproducible execution paths under retries and load.

3. Single-responsibility nodes

Each step does one thing only:

  • Classify
  • Fetch data
  • Draft response
  • Send email

No step secretly coordinates others.

This makes:

  • Retries safe
  • Failures isolated
  • Reasoning about behavior tractable

4. Idempotent execution

Every step must be safe to re-run.

Techniques:

  • Upserts instead of inserts
  • “Already completed” flags in state
  • External calls guarded by recorded outcomes

This is how you avoid double-sending emails when a workflow resumes.

Classic Process Manager systems rely on correlation IDs; agent systems rely on state checks.

5. Human-in-the-loop as a first-class pause

Human approval is not a special case — it is a pause in execution.

The workflow:

  • Records what it is waiting for
  • Stops executing
  • Resumes when input arrives

No polling loops. No blocked threads. No fragile callbacks.

This is Process Manager thinking applied to human interaction.

6. Schema guards

LLM outputs are validated before they affect state or trigger actions.

For a customer email reply, that might mean:

  • Required apology present
  • Tone constrained to approved values
  • No promises outside policy

If validation fails:

  • Repair
  • Retry
  • Or escalate

Schema guards are the agent-era equivalent of canonical message models.

7. Backpressure and concurrency limits

Stability requires saying “no” under load.

Examples:

  • Limit concurrent calls to order systems
  • Queue excess work
  • Slow intake rather than cascading failure

This prevents agent workflows from overwhelming downstream systems.

8. Observable execution

Every step emits:

  • Start
  • Success
  • Failure
  • Decision made

When something goes wrong, you should be able to answer:

  • Which step failed?
  • With what state?
  • What would happen if we resumed now?

Observability is not optional when workflows span minutes, hours, or days.


Example: a stable customer service email workflow

Input:

“My order hasn’t arrived and I need it for Monday. Can you refund shipping?”

Step 1 — Ingest

State initialized:

  • email_id
  • customer_id
  • raw_message
  • status = "received"

Step 2 — Intent & risk classification

  • intent = "delivery issue"
  • risk = "refund request"

Routing rule: refund → approval required

Step 3 — Fetch order details

External system call:

  • Guarded by retries
  • Results stored in state

Step 4 — Draft response

LLM generates a reply draft.

Step 5 — Schema validation

Validate:

  • Apology included
  • Refund language compliant
  • Tone acceptable

If invalid → repair or escalate.

Step 6 — Human approval (pause)

Workflow records:

  • Awaiting approval
  • Proposed reply

Execution stops.

Step 7 — Resume after approval

Human approves. Workflow resumes exactly where it left off.

Step 8 — Send email (idempotent)

Before sending:

  • Check send_status == pending

Send email. Record provider message ID. Set send_status = sent.

If the workflow retries later, nothing is sent twice.


Why this is still a Process Manager — just evolved

If you strip away the LLMs, this workflow looks extremely familiar:

  • A coordinator
  • Explicit state
  • Conditional routing
  • Resumability
  • Human interaction
  • Idempotent side effects

The difference is that agent frameworks make this explicit and programmable, rather than implicit and bespoke.

The Process Manager taught us that orchestration is a distributed systems problem. Agent workflows simply inherit that truth — with new failure modes.


Parallel execution: fan-out/fan-in

When steps are independent, they can run in parallel. Here’s an example where we fetch order details and refund policy simultaneously:

In code, this fan-out/fan-in pattern looks like:

@entrypoint(checkpointer=checkpointer)
def handle_email(email: dict) -> dict:
    # Sequential: must classify first
    classification = classify_email(email).result()

    # Fan-out: these are independent, run in parallel
    order_future = fetch_order_details(email["order_id"])
    policy_future = fetch_refund_policy(classification["category"])

    # Fan-in: wait for both to complete
    order_details = order_future.result()
    refund_policy = policy_future.result()

    # Continue sequentially with merged context
    context = {**order_details, **refund_policy}
    draft = draft_reply(email, context).result()

    return draft

The key stability considerations for parallel execution:

  • Independent failures: If one parallel task fails, decide whether to fail the whole workflow or continue with partial results
  • Timeout handling: Set timeouts on parallel tasks to prevent indefinite waits
  • State consistency: Each parallel branch should write to isolated state keys to avoid conflicts

Closing thought

Stable agent systems are not built by chaining prompts.

They are built by applying decades of integration architecture — deliberately, explicitly, and with humility about failure.

References

  • Process Manager Pattern — Gregor Hohpe’s canonical description of the Process Manager pattern from Enterprise Integration Patterns
  • Thinking in LangGraph — LangChain’s guide to designing workflows with explicit state and conditional routing