Runs & debugging

A job run is a single execution of a job. Runs are immutable — re-running a job creates a new run, never mutating an old one. This page covers the run lifecycle, the recovery actions, and how to diagnose a misbehaving run.

For the job definition, see Defining jobs. For scheduling, see Scheduling.

Prerequisites

A saved job that has been triggered at least once (manually or on schedule) so there's a run to inspect.

Run lifecycle

Runs move through a detailed lifecycle so you can tell exactly where a stuck run is stuck.

Status	Meaning
`pending`	Run accepted, queued behind the scheduler
`ready`	Ready to be placed on a node
`starting`	Kubernetes pod being created
`creating_sail` / `waiting_for_sail`	Sail (the query engine) is being deployed
`creating_runner`	Runner pod starting
`running`	Your code is executing
`succeeded` / `failed` / `cancelled` / `timeout`	Terminal states

Most runs move from pending to running in under a minute on a warm cluster. The first run after a cold cluster is slower — Karpenter has to schedule new compute nodes before anything else happens.

Inspect a run

Click into a run for:

Logs — stdout and stderr from Sail and the runner.
Metrics — CPU, memory, duration.
Result — any output the job wrote.
Version link — back to the exact version of the job that dispatched. Useful when the job has changed since this run fired.

Recovery actions

Two actions on non-terminal runs:

Retry — create a new run of the same job version. Fastest path to recovery when a transient infrastructure failure tripped a pipeline.
Release — detach the run from its Kubernetes resources without cancelling. Use when something's stuck at the infra layer and you want to clean up without aborting running code.

Retry creates a new run object. Release leaves the existing run in its current state but frees the pod and any cluster-side resources.

Debugging checklist

When a run misbehaves, work from the layer closest to the failure outward:

1. Run status is `failed`

Open the run and read the logs. Application errors — SQL syntax, missing columns, Python exceptions — surface here with stack traces.

Common causes:

Table or column doesn't exist in the catalog.
Permission denied on S3 / Glue (check the IAM role's workload boundary).
Python wheel missing a dependency.

2. Run stuck in `waiting_for_sail`

The Sail pod couldn't reach Ready within ~10 minutes. See the troubleshooting entry for the full diagnosis. Usually: cluster out of compute capacity, image pull failure, or EC2 service quota hit.

3. Run status is `timeout`

The job exceeded its configured timeout. Either raise the timeout (if the workload legitimately grew) or investigate why it's slower than expected — a common cause is a catalog that returned a much larger result set than before.

4. Scheduled tick didn't fire

Is the job Active (not paused)?
Does the cron expression match what you think it does? Test it with a cron parser.
Was the cluster healthy at the expected time? A failed or destroying cluster drops ticks on the floor.
Does the missed-schedule policy match your expectations? latest only fires once on recovery.

5. Schedule fires but runs overlap unexpectedly

Revisit the concurrency policy. If ticks are piling up, skip or replace prevents the pileup; allow without a sensible maxConcurrentRuns is a foot-gun.

6. "failed to create kubernetes client" at start

Infrastructure-layer failure, not job logic. See troubleshooting.

When to cancel vs. retry vs. release

Cancel — you don't want the work to complete. Produces a terminal cancelled state.
Retry — you want the work to run again, fresh. Leaves the failed run in history and creates a new one.
Release — the run's Kubernetes resources are orphaned or stuck, but the run state is fine. Frees the pod without affecting status.

If you're unsure, cancel first. Cancelling is reversible in effect (you can re-trigger a new run); releasing without cancelling leaves a potentially still-running pod behind.

API reference

Job runs — CreateJobRun, DescribeJobRun, RetryJobRun, ReleaseJobRun, DeleteJobRun.

Runs & debugging ​

Prerequisites ​

Run lifecycle ​

Inspect a run ​

Recovery actions ​

Debugging checklist ​

1. Run status is failed ​

2. Run stuck in waiting_for_sail ​

3. Run status is timeout ​

4. Scheduled tick didn't fire ​

5. Schedule fires but runs overlap unexpectedly ​

6. "failed to create kubernetes client" at start ​

When to cancel vs. retry vs. release ​

API reference ​