Skip to main content

Metrics

If you are using Conductor, you can scrape metrics about your applications' workflows, steps, and executors from a Prometheus-compatible endpoint. This lets you monitor your DBOS applications in Prometheus, Grafana, or any other tool that understands the OpenMetrics format.

info

Metrics require at least a DBOS Teams plan.

info

Metrics require DBOS Python >=2.23.0 or DBOS TypeScript >=4.19.

The Metrics Endpoint

Conductor exposes metrics for all of your applications at a single Prometheus-compatible OpenMetrics scrape endpoint:

https://cloud.dbos.dev/v1/metrics

The endpoint is authenticated with a Conductor API key, passed as a bearer token in the Authorization header. You can generate an API key from the key settings page of the DBOS Console. Make sure to enable the metrics read permission for the key.

A scrape is a simple authenticated GET:

curl https://cloud.dbos.dev/v1/metrics \
-H "Authorization: Bearer $DBOS_API_KEY" \
-H "Accept: application/openmetrics-text"

Endpoint Integrations

The endpoint works with any tool that can scrape the OpenMetrics or Prometheus exposition format.

To scrape the endpoint from Prometheus, add a job like the following to your prometheus.yml. Store your API key in a file and reference it with authorization.credentials_file (or use credentials directly):

scrape_configs:
- job_name: dbos
scheme: https
metrics_path: /v1/metrics
scrape_interval: 60s
honor_timestamps: true
static_configs:
- targets: ["cloud.dbos.dev"]
authorization:
type: Bearer
credentials_file: /etc/prometheus/dbos_api_key

Set honor_timestamps: true so the window timestamps the endpoint emits are preserved.

Filtering metrics

By default the endpoint returns every metric for every application in your organization. You can narrow a scrape with these repeatable query parameters:

ParameterDescription
applicationsOnly report metrics for the named application(s). Matched exactly.
workflow_namesOnly report metrics for the named workflow(s).
metricsOnly emit the named metric families (e.g. dbos_conductor_v1_workflow_success_rate).

Each parameter may be repeated to select multiple values, for example:

https://cloud.dbos.dev/v1/metrics?applications=my-app&applications=my-other-app

Available Metrics

Every metric this endpoint emits is an OpenMetrics gauge. All metric names are prefixed with dbos_conductor_v1_, and every series carries an application label.

Aggregation window

Each scrape reports data for the most recently completed clock-aligned minute. For example, a scrape at any time during 12:34 reports data aggregated over the window [12:33:00, 12:34:00).

Although every metric is a gauge, the value a gauge carries falls into one of three flavors, noted in the Measurement column below:

  • Rate — a per-second average over the window. For example, if 120 workflows succeeded in the window, workflow_success_rate reports 2; multiply by 60 to recover the count over the minute. Because these are already-averaged gauges (not counters), do not wrap them in PromQL rate().
  • Point-in-time — the value at scrape time, not tied to the window (for example, the number of workflows currently enqueued).
  • Windowed — an aggregate, such as a maximum, computed over the window.

Rate and windowed metrics are stamped with the window's timestamp (so scrapes within the same minute deduplicate); point-in-time metrics carry no explicit timestamp and use the scrape time.

The windowed metrics (workflow_max_queue_wait_seconds, workflow_max_total_latency_seconds, and step_max_duration_seconds) report a maximum per label group. When you combine groups in a query, aggregate them with max() — a maximum of maximums is still a maximum — rather than sum() or avg(), which are not meaningful over these values.

Workflow metrics

These metrics are labeled by workflow_name and, where noted, queue_name.

MetricMeasurementDescription
workflow_started_rateRateWorkflows created per second. Labeled by queue.
workflow_dequeued_rateRateEnqueued workflows dequeued per second. Workflows that were never enqueued are not counted. Labeled by queue.
workflow_success_rateRateWorkflows that completed successfully per second. Labeled by queue.
workflow_failed_rateRateWorkflows that terminated with an error (ERROR or MAX_RECOVERY_ATTEMPTS_EXCEEDED) per second. Labeled by queue.
workflow_cancelled_rateRateWorkflows that were cancelled per second. Labeled by queue.
workflow_enqueued_countPoint-in-timeWorkflows currently in the ENQUEUED state. Labeled by queue.
workflow_pending_countPoint-in-timeWorkflows currently in the PENDING (executing) state.
workflow_oldest_enqueued_timestamp_secondsPoint-in-timeUnix timestamp (seconds) of the oldest workflow currently ENQUEUED. Use time() - <metric> to derive its age. No series is emitted when no workflows are enqueued. Labeled by queue.
workflow_oldest_pending_timestamp_secondsPoint-in-timeUnix timestamp (seconds) of the oldest workflow currently PENDING. Use time() - <metric> to derive its age. No series is emitted when no workflows are pending.
workflow_max_queue_wait_secondsWindowedMaximum queue wait (created to first started), in seconds, across workflows that completed successfully in the window. Labeled by queue.
workflow_max_total_latency_secondsWindowedMaximum end-to-end latency (created to completed), in seconds, across workflows that completed successfully in the window. Labeled by queue.

Step metrics

These metrics are labeled by step_name.

MetricMeasurementDescription
step_success_rateRateWorkflow steps that completed successfully per second.
step_failed_rateRateWorkflow steps that terminated with an error per second.
step_max_duration_secondsWindowedMaximum single-step duration, in seconds, across steps that completed successfully in the window.

Executor metrics

MetricMeasurementDescription
executor_countPoint-in-timeNumber of executors registered for the application, labeled by status and application_version.

Example Queries

A few example PromQL queries:

# Number of workflows that completed successfully in the past hour, across all workflows and queues.
# workflow_success_rate is a per-second gauge, so average it over the hour and multiply by 3600 seconds.
sum(avg_over_time(dbos_conductor_v1_workflow_success_rate{application="my-app"}[1h])) * 3600

# Age, in seconds, of the oldest currently enqueued workflow
time() - dbos_conductor_v1_workflow_oldest_enqueued_timestamp_seconds

# Number of healthy executors per application
sum by (application) (dbos_conductor_v1_executor_count{status="HEALTHY"})