Skip to content
Go back

From Process Managers to Stable Agent Workflows

Published:  at  10:00 AM

A Customer Service Email Example

TL;DR

• Agent workflows inherit distributed systems problems — retries, failures, human approvals, long-running steps

• The Process Manager pattern from enterprise integration provides the mental model for stable agents

• Stability comes from explicit state, deterministic routing, idempotent execution, and schema guards — not better prompts

Agent workflows feel new, but the problems they face are not.

Retries, partial failures, human approvals, long-running steps, and external dependencies have existed for decades in enterprise systems. What has changed is that we now place probabilistic systems (LLMs) inside these workflows — which makes stability a first-order design concern.

To understand how to design stable agent workflows, it helps to revisit an older but extremely relevant integration pattern: the Process Manager.


What problem does the Process Manager solve?

The Process Manager pattern exists to coordinate multi-step business processes where:

Rather than embedding orchestration logic inside each processing unit, a Process Manager:

Crucially, it treats the process as a long-lived thing, not a request/response interaction.

This turns out to be exactly the mental model required for production agent systems.


From message orchestration to agent orchestration

Agent frameworks replace message handlers with nodes, and message headers with explicit state — but the underlying architectural challenge is the same:

How do we coordinate multiple steps safely when some steps involve unreliable systems, humans, and retries?

The answer is not “better prompts”. The answer is stability patterns.


Stability patterns for agent workflows

These patterns are not about intelligence — they are about survivability.

1. Explicit, durable state

All workflow progress lives in a persisted state object:

State is the contract. If the system restarts, state tells you exactly where you are.

This mirrors the Process Manager’s responsibility for tracking the sequence of steps.

2. Deterministic routing

Control flow is decided by explicit rules, not free-form model output.

Examples:

LLMs may inform decisions, but they do not own control flow.

This prevents non-reproducible execution paths under retries and load.

3. Single-responsibility nodes

Each step does one thing only:

No step secretly coordinates others.

This makes:

4. Idempotent execution

Every step must be safe to re-run.

Techniques:

This is how you avoid double-sending emails when a workflow resumes.

Classic Process Manager systems rely on correlation IDs; agent systems rely on state checks.

5. Human-in-the-loop as a first-class pause

Human approval is not a special case — it is a pause in execution.

The workflow:

No polling loops. No blocked threads. No fragile callbacks.

This is Process Manager thinking applied to human interaction.

6. Schema guards

LLM outputs are validated before they affect state or trigger actions.

For a customer email reply, that might mean:

If validation fails:

Schema guards are the agent-era equivalent of canonical message models.

7. Backpressure and concurrency limits

Stability requires saying “no” under load.

Examples:

This prevents agent workflows from overwhelming downstream systems.

8. Observable execution

Every step emits:

When something goes wrong, you should be able to answer:

Observability is not optional when workflows span minutes, hours, or days.


Example: a stable customer service email workflow

Input:

“My order hasn’t arrived and I need it for Monday. Can you refund shipping?”

Step 1 — Ingest

State initialized:

Step 2 — Intent & risk classification

Routing rule: refund → approval required

Step 3 — Fetch order details

External system call:

Step 4 — Draft response

LLM generates a reply draft.

Step 5 — Schema validation

Validate:

If invalid → repair or escalate.

Step 6 — Human approval (pause)

Workflow records:

Execution stops.

Step 7 — Resume after approval

Human approves. Workflow resumes exactly where it left off.

Step 8 — Send email (idempotent)

Before sending:

Send email. Record provider message ID. Set send_status = sent.

If the workflow retries later, nothing is sent twice.


Why this is still a Process Manager — just evolved

If you strip away the LLMs, this workflow looks extremely familiar:

The difference is that agent frameworks make this explicit and programmable, rather than implicit and bespoke.

The Process Manager taught us that orchestration is a distributed systems problem. Agent workflows simply inherit that truth — with new failure modes.


Parallel execution: fan-out/fan-in

When steps are independent, they can run in parallel. Here’s an example where we fetch order details and refund policy simultaneously:

In code, this fan-out/fan-in pattern looks like:

@entrypoint(checkpointer=checkpointer)
def handle_email(email: dict) -> dict:
    # Sequential: must classify first
    classification = classify_email(email).result()

    # Fan-out: these are independent, run in parallel
    order_future = fetch_order_details(email["order_id"])
    policy_future = fetch_refund_policy(classification["category"])

    # Fan-in: wait for both to complete
    order_details = order_future.result()
    refund_policy = policy_future.result()

    # Continue sequentially with merged context
    context = {**order_details, **refund_policy}
    draft = draft_reply(email, context).result()

    return draft

The key stability considerations for parallel execution:


Closing thought

Stable agent systems are not built by chaining prompts.

They are built by applying decades of integration architecture — deliberately, explicitly, and with humility about failure.

References


Suggest Changes

Next Post
Keeping State Consistent: Database Transactions in LangGraph Workflows