Skip to content

Runs & debugging

A job run is a single execution of a job. Runs are immutable — re-running a job creates a new run, never mutating an old one. This page covers the run lifecycle, the recovery actions, and how to diagnose a misbehaving run.

For the job definition, see Defining jobs. For scheduling, see Scheduling.

Prerequisites

  • A saved job that has been triggered at least once (manually or on schedule) so there's a run to inspect.

Run lifecycle

Runs move through a detailed lifecycle so you can tell exactly where a stuck run is stuck.

StatusMeaning
pendingRun accepted, queued behind the scheduler
readyReady to be placed on a node
startingKubernetes pod being created
creating_sail / waiting_for_sailSail (the query engine) is being deployed
creating_runnerRunner pod starting
runningYour code is executing
succeeded / failed / cancelled / timeoutTerminal states

Most runs move from pending to running in under a minute on a warm cluster. The first run after a cold cluster is slower — Karpenter has to schedule new compute nodes before anything else happens.

Inspect a run

Click into a run for:

  • Logs — stdout and stderr from Sail and the runner.
  • Metrics — CPU, memory, duration.
  • Result — any output the job wrote.
  • Version link — back to the exact version of the job that dispatched. Useful when the job has changed since this run fired.

Recovery actions

Two actions on non-terminal runs:

  • Retry — create a new run of the same job version. Fastest path to recovery when a transient infrastructure failure tripped a pipeline.
  • Release — detach the run from its Kubernetes resources without cancelling. Use when something's stuck at the infra layer and you want to clean up without aborting running code.

Retry creates a new run object. Release leaves the existing run in its current state but frees the pod and any cluster-side resources.

Debugging checklist

When a run misbehaves, work from the layer closest to the failure outward:

1. Run status is failed

Open the run and read the logs. Application errors — SQL syntax, missing columns, Python exceptions — surface here with stack traces.

Common causes:

  • Table or column doesn't exist in the catalog.
  • Permission denied on S3 / Glue (check the IAM role's workload boundary).
  • Python wheel missing a dependency.

2. Run stuck in waiting_for_sail

The Sail pod couldn't reach Ready within ~10 minutes. See the troubleshooting entry for the full diagnosis. Usually: cluster out of compute capacity, image pull failure, or EC2 service quota hit.

3. Run status is timeout

The job exceeded its configured timeout. Either raise the timeout (if the workload legitimately grew) or investigate why it's slower than expected — a common cause is a catalog that returned a much larger result set than before.

4. Scheduled tick didn't fire

  • Is the job Active (not paused)?
  • Does the cron expression match what you think it does? Test it with a cron parser.
  • Was the cluster healthy at the expected time? A failed or destroying cluster drops ticks on the floor.
  • Does the missed-schedule policy match your expectations? latest only fires once on recovery.

5. Schedule fires but runs overlap unexpectedly

Revisit the concurrency policy. If ticks are piling up, skip or replace prevents the pileup; allow without a sensible maxConcurrentRuns is a foot-gun.

6. "failed to create kubernetes client" at start

Infrastructure-layer failure, not job logic. See troubleshooting.

When to cancel vs. retry vs. release

  • Cancel — you don't want the work to complete. Produces a terminal cancelled state.
  • Retry — you want the work to run again, fresh. Leaves the failed run in history and creates a new one.
  • Release — the run's Kubernetes resources are orphaned or stuck, but the run state is fine. Frees the pod without affecting status.

If you're unsure, cancel first. Cancelling is reversible in effect (you can re-trigger a new run); releasing without cancelling leaves a potentially still-running pod behind.

API reference

  • Job runsCreateJobRun, DescribeJobRun, RetryJobRun, ReleaseJobRun, DeleteJobRun.