Runs & debugging
A job run is a single execution of a job. Runs are immutable — re-running a job creates a new run, never mutating an old one. This page covers the run lifecycle, the recovery actions, and how to diagnose a misbehaving run.
For the job definition, see Defining jobs. For scheduling, see Scheduling.
Prerequisites
- A saved job that has been triggered at least once (manually or on schedule) so there's a run to inspect.
Run lifecycle
Runs move through a detailed lifecycle so you can tell exactly where a stuck run is stuck.
| Status | Meaning |
|---|---|
pending | Run accepted, queued behind the scheduler |
ready | Ready to be placed on a node |
starting | Kubernetes pod being created |
creating_sail / waiting_for_sail | Sail (the query engine) is being deployed |
creating_runner | Runner pod starting |
running | Your code is executing |
succeeded / failed / cancelled / timeout | Terminal states |
Most runs move from pending to running in under a minute on a warm cluster. The first run after a cold cluster is slower — Karpenter has to schedule new compute nodes before anything else happens.
Inspect a run
Click into a run for:
- Logs — stdout and stderr from Sail and the runner.
- Metrics — CPU, memory, duration.
- Result — any output the job wrote.
- Version link — back to the exact version of the job that dispatched. Useful when the job has changed since this run fired.
Recovery actions
Two actions on non-terminal runs:
- Retry — create a new run of the same job version. Fastest path to recovery when a transient infrastructure failure tripped a pipeline.
- Release — detach the run from its Kubernetes resources without cancelling. Use when something's stuck at the infra layer and you want to clean up without aborting running code.
Retry creates a new run object. Release leaves the existing run in its current state but frees the pod and any cluster-side resources.
Debugging checklist
When a run misbehaves, work from the layer closest to the failure outward:
1. Run status is failed
Open the run and read the logs. Application errors — SQL syntax, missing columns, Python exceptions — surface here with stack traces.
Common causes:
- Table or column doesn't exist in the catalog.
- Permission denied on S3 / Glue (check the IAM role's workload boundary).
- Python wheel missing a dependency.
2. Run stuck in waiting_for_sail
The Sail pod couldn't reach Ready within ~10 minutes. See the troubleshooting entry for the full diagnosis. Usually: cluster out of compute capacity, image pull failure, or EC2 service quota hit.
3. Run status is timeout
The job exceeded its configured timeout. Either raise the timeout (if the workload legitimately grew) or investigate why it's slower than expected — a common cause is a catalog that returned a much larger result set than before.
4. Scheduled tick didn't fire
- Is the job Active (not paused)?
- Does the cron expression match what you think it does? Test it with a cron parser.
- Was the cluster healthy at the expected time? A
failedordestroyingcluster drops ticks on the floor. - Does the missed-schedule policy match your expectations?
latestonly fires once on recovery.
5. Schedule fires but runs overlap unexpectedly
Revisit the concurrency policy. If ticks are piling up, skip or replace prevents the pileup; allow without a sensible maxConcurrentRuns is a foot-gun.
6. "failed to create kubernetes client" at start
Infrastructure-layer failure, not job logic. See troubleshooting.
When to cancel vs. retry vs. release
- Cancel — you don't want the work to complete. Produces a terminal
cancelledstate. - Retry — you want the work to run again, fresh. Leaves the failed run in history and creates a new one.
- Release — the run's Kubernetes resources are orphaned or stuck, but the run state is fine. Frees the pod without affecting status.
If you're unsure, cancel first. Cancelling is reversible in effect (you can re-trigger a new run); releasing without cancelling leaves a potentially still-running pod behind.
API reference
- Job runs —
CreateJobRun,DescribeJobRun,RetryJobRun,ReleaseJobRun,DeleteJobRun.