Skip to main content

How a run executes and recovers

Understand what a run is, the seven statuses it moves through, and how replay and resume let you recover from a failure.

A run is one execution of a published workflow version. Something fires the trigger, the platform walks the graph one step at a time, and the run ends in a terminal state. Every run is a record of what happened: which version executed, what the trigger sent in, what each step produced, and where it stopped.

Two facts shape everything else on this page. A run is pinned to the exact workflow version it started on, and a finished run is immutable. You never edit a run after the fact. When something goes wrong, you create a new run from the old one (replay) or you continue a run that is still parked or paused (resume). Understanding which of those two paths applies is the whole job of reading the run model.

What starts a run

A run begins one of two ways.

A live event fires the trigger of the workflow's active version. Only the active version receives live events, so a workflow with no active version produces no runs. The event can be a webhook delivery, a polling cycle, a schedule firing, an inbound email, an RSS fetch, and so on, depending on the trigger you configured. See triggers and actions for how each trigger kind shapes the payload.

A manual start runs the active version on demand, optionally with a payload you supply. This is how you test a published workflow without waiting for a real event. If nothing is published, a manual start returns an error telling you to publish first.

One run is created per activation. For batch-shaped triggers (webhook, RSS, polling) this matters: a poll that returns 50 items fires one run with all 50 items in the trigger payload, not 50 separate runs. To process those items one at a time you add a Loop node downstream. A digest-style workflow that handles the whole batch at once skips the Loop. This is a deliberate design choice so that you decide whether work fans out, rather than the platform guessing.

The seven run statuses

A run reports exactly one status at a time. There are seven.

StatusMeaningTerminal
pendingThe run was accepted but no step has started executing yet.No
runningSteps are executing now.No
waitingThe run is parked, either on a timer (a delay step) or on something external (an approval, a form submission, a wait-for-event, a sub-workflow barrier, or a client connection it needs).No
pausedAn operator paused the run, or run control halted it.No
completedEvery step finished successfully.Yes
failedA step errored and the run stopped.Yes
cancelledAn operator cancelled the run.Yes

The three terminal states (completed, failed, cancelled) are final. A run in one of them will never change status on its own. The four non-terminal states can still move: pending becomes running, running can move to waiting, paused, or any terminal state, and a paused run resumes back to running.

pending is usually brief. If a run sits in pending longer than you expect, the most common cause is that the workflow is not actually published and active, so confirm that first.

completed is the only success word

Steps inside a run share these seven statuses and add one more: skipped, for a step on a branch path that was not taken. A skipped step is expected, not an error. There is no separate "succeeded" status anywhere you will see it; the success state for both runs and steps is completed.

Why a run waits versus pauses

waiting and paused look similar in a list but mean different things, and they clear differently.

A run is waiting because the workflow itself asked it to. A delay step holds it until a timer fires. An approval step holds it until someone decides. A form step holds it until the form is submitted. A connection request holds it until a client authorizes the app the next step needs. In each case the run resumes automatically once the condition is met. You do not click anything to continue a waiting run that is waiting on a timer; you resolve the underlying thing (approve the task, submit the form, get the client to authorize).

A run is paused because you (or a control action) stopped it. You resume it manually from the run detail page.

There is a third stop that is easy to confuse with a paused run: the workflow itself can be set to Paused (safety) by the safety breaker after a misconfiguration like a runaway loop or a reached spend cap. That is a workflow status, not a run status, and it stops new runs from starting rather than freezing one in progress. See why a run paused for the full set of reasons and exactly how each one clears.

Reading a failed run

A failed run stopped because one step errored. The recovery decision starts with reading that step.

Open the run, find the first step with status failed in the step timeline (the run detail view links you straight to it and expands it), and read its error. Each failed step carries a structured error you can act on:

  • A code, a short machine-readable string such as AUTH_MISSING or TIMEOUT.
  • A message in plain language.
  • A provider status, shown as "Provider status: 429" when the failure came from an upstream app's HTTP response.
  • A retryable flag, shown as "(retryable)" next to the provider status when re-running the step has a reasonable chance of succeeding.
  • An optional technical detail block holding the code and any extra context for support.

The retryable signal is the fork in the road. A retryable failure (a timeout, a rate limit, a transient 5xx from a provider) usually clears on its own; replaying the run as-is is the right move. A non-retryable failure (a bad credential, a malformed mapping, a missing required field) will fail again identically until you change the workflow, so fix the configuration first, then replay.

There is no fixed catalog of error codes. Codes come from the platform and from the apps your steps call, so treat the shape (code, message, provider status, retryable) as the thing to read, not a closed list to memorize. Read a failed run walks a full example.

Recovery path one: replay

Replay creates a brand-new run from an existing one. It does not re-execute the old run in place; the original record stays exactly as it was, and you get a fresh newRunId to watch. This is why replay is safe to reach for: you are never destroying evidence of what went wrong.

Replay is offered only for terminal runs, those in completed, failed, or cancelled. You cannot replay a run that is still pending or running; let it finish first.

There are three replay modes:

  • Replay as-is re-runs the workflow unchanged with the same trigger input. This is the mode you want after fixing a configuration problem or recovering from a transient failure.
  • Replay with a new prompt and replay with a new model swap the AI prompt or model used by an agent step. These are operator-grade and ask you to paste internal identifiers by hand, so reach for them only when you are deliberately re-running an AI step against a different prompt or model.

If you replay against a different workflow version than the one the run started on, and that version renamed or restructured nodes, you resolve the renamed nodes before the replay can run. Replay a run covers the modes and node mapping in detail.

Replay has limits worth knowing before you lean on it:

Replay limits

Replays are capped at 20 per minute per workspace. A replay chain (replaying a replay of a replay) is capped at a depth of 5. You cannot replay an in-flight run, and you cannot replay across workspaces. A workflow in the Paused (safety) state rejects replays until you clear the safety pause.

Recovery path two: pause, resume, and cancel

The second path is controlling a run that is still alive. Three controls exist, and each is valid only for certain statuses:

  • Pause stops a running or pending run. The run moves to paused.
  • Resume continues a paused run. It moves back to running.
  • Cancel stops any run that has not already reached a terminal state. Cancelling is permanent: it ends all pending and running steps and the run becomes cancelled. You can record an optional reason; if you do not, it is recorded as "Cancelled by user".

These controls enforce valid transitions on the server, not just in the interface. If your view is stale (say someone else cancelled the run while your page was open) and you fire an action that no longer applies, the action is rejected rather than silently doing the wrong thing. Refresh and look at the current status.

Resuming is not the same as replaying. Resume continues the same run from where it paused. Replay starts a new run from the beginning. Use resume when a run is merely paused or parked; use replay when a run already finished and you want to run it again.

Runs are pinned to their version

Each run carries the workflow version it started on, and that pin never changes for the life of the run. This is what makes editing a live workflow safe.

When you edit an active version, the platform forks your changes into a new draft and leaves the running version untouched. Publishing that draft is what swaps which version is live. Runs already in flight keep executing on the version they started on; new events go to the newly published version. Nothing in progress is disrupted, and a run's record always reflects the exact graph that produced it. Workflows explains the draft-to-published lifecycle behind this.

This pin is also why cross-version replay needs node mapping. If you replay an old run against a newer version whose nodes were renamed, the platform cannot assume which new node replaces which old one, so it asks you.

What runs do not do

Be honest with yourself about two things the run model does not give you.

A finished run cannot be "fixed" or edited. There is no in-place re-execution and no partial restart from the middle of a failed run. Recovery always means a new run (replay) from the start, or continuing a run that is still alive (resume). If you fixed a bug halfway through a long workflow, the replay still starts at step one.

Run-scoped variables are not available. Steps read each other's output through expressions, but there is no general scratchpad you can write to and read back later in the run. Pass data forward through step output, not through a variable store.

Where to go next

  • Run and step statuses defines every status with its next action.
  • Workflows explains how publishing swaps which version is live without disrupting in-flight runs.
Was this helpful?