Skip to main content

Workflow Recovery

When the execution of a durable workflow is interrupted (for example, if its executor is restarted, interrupted, or crashes), another executor must recover the workflow and resume its execution. To prevent duplicate work, it is important to detect interruptions promptly and to recover each workflow only once. This guide describes how to manage workflow recovery in a production environment.

Managing Recovery

Recovery On A Single Server

If hosting an application on a single server without Conductor, each time you restart your application's process, DBOS recovers all workflows that were executing before the restart (all PENDING workflows).

Recovery in a Distributed Setting

When self-hosting in a distributed setting without Conductor, it is important to manage workflow recovery so that when an executor crashes, restarts, or is shut down, its workflows are recovered. You should assign each executor running a DBOS application an executor ID by setting the DBOS__VMID environment variable or by passing in executor_id to DBOS config. Each workflow is tagged with the ID of the executor that started it. When an application with an executor ID restarts, it only recovers pending workflows assigned to that executor ID. You can also instruct your executor to recover workflows assigned to other executor IDs through the workflow recovery endpoint of the admin API.

Recovery With Conductor

If your application is connected to DBOS Conductor, workflow recovery is automatic. When Conductor detects that an executor is unhealthy, it automatically signals another executor to recover its workflows.

When an executor disconnects from Conductor, its status is changed to DISCONNECTED while Conductor waits for it to reconnect. If it has not reconnected after a grace period, its status is changed to DEAD and Conductor signals another executor of a compatible application version to recover its workflows. After recovery is confirmed, the executor is deleted.