Troubleshooting
Issues on this page are real — they correspond to error states you'll actually see in the UI, not hypothetical edge cases. Each entry lists the message exactly as it appears so search lands you here.
Before anything else, check which layer is failing: cloud account, network, cluster, or workload. A failure at a lower layer (e.g. cloud account) makes everything above it look broken.
Cloud account: "We couldn't verify the connection"
Message: "We couldn't verify the connection. Check the stack status in CloudFormation, then Verify Connection again."
What happened: LakeSail tried to assume the IAM role you pasted and the call failed (failed_to_assume_role). The CloudFormation stack may not have finished, the role ARN may be wrong, or the trust policy may have been edited.
Fix:
- Open the CloudFormation stack in your AWS console. Confirm the status is
CREATE_COMPLETE. - Open the Outputs tab. Copy the Role ARN exactly — spaces and extra characters are a common cause.
- Paste it into LakeSail and click Verify Connection again.
- If it still fails, open the role in IAM and confirm the trust policy includes the LakeSail principal and external ID shown in the modal. If you modified the template, redeploy the stack unchanged.
Cloud account: "Failed to create connection"
Message: "Failed to create connection. Please try again."
What happened: The ARN verified, but the backend couldn't persist the connection record.
Fix: Retry. If it repeats, capture the timestamp and contact support — this isn't a config issue on your side.
Network: "Failed to load regions"
Message: "Failed to load regions. Please try again."
What happened: LakeSail tried to enumerate AWS regions through your connected cloud account and the assume-role call failed. Usually this means the cloud account credentials are no longer valid — the trust policy was changed, the role was deleted, or the external ID is out of sync.
Fix:
- Go to Settings → Cloud Accounts and check the account's status.
- If it's
disconnectedorfailed, redeploy the CloudFormation stack (same connection ID — do not create a new one) and re-verify. - If it's
active, wait a minute and retry. Transient IAM propagation delays resolve on their own.
Network: CIDR overlap
What happened: The CIDR you entered overlaps the LakeSail platform VPC (or a VPC you're peering with), so provisioning refuses to continue.
Fix: Pick a CIDR that doesn't overlap. 10.0.0.0/16, 10.100.0.0/16, 172.20.0.0/16, or any private range not already in use is usually safe. Avoid 172.16.0.0/16 and 192.168.0.0/16 if you expect to peer with office networks that use them.
Cluster: "Min/Desired nodes cannot exceed max nodes"
Messages:
- "Min nodes cannot exceed max nodes."
- "Desired nodes cannot exceed max nodes."
What happened: Node group sizes are out of order.
Fix: Adjust the numbers so min ≤ desired ≤ max. The autoscaler needs a valid range before it will accept the config.
Cluster: instance too small for system pods
Message pattern: "{instance type} supports only N pods per node, but each system node needs at least M (K DaemonSet pods + L rolling-restart margin). Choose a larger instance type."
What happened: The instance type you picked has a per-node pod limit below what LakeSail's system pods need. This isn't about your workload — it's about the VPC CNI, kube-proxy, EBS CSI, and a margin for rolling restarts.
Fix: Pick a larger management-node instance. m8g.large (the default) fits comfortably; anything smaller usually doesn't.
Cluster: not enough CPU for system controllers
Message pattern: "{instance type} with N minimum node(s) provides ~Xm allocatable CPU, but at least Ym are needed for system controllers. Choose a larger instance type or increase the minimum node count."
What happened: Your min node count × the instance's allocatable CPU isn't enough for Karpenter and other system controllers. Pods will fail to schedule.
Fix: Either raise Min Management Nodes, or pick a larger instance type. The defaults (1 min, m8g.large) are fine for most evaluation setups.
Job run stuck in waiting_for_sail
Underlying error: "timeout waiting for sail server to be ready"
What happened: The Sail engine pod for this run couldn't reach Ready within ~10 minutes. Usual causes: the cluster is scheduling new compute nodes (cold start), an image pull failed, or the node group can't grow because of AWS capacity or quota.
Fix:
- Check the cluster's compute usage. If the cluster is already at max size and all nodes are busy, either raise the max or wait for other work to finish.
- If no new nodes appear in AWS after 5 minutes, check your EC2 service quota in the region — Karpenter will surface quota errors to the cluster event log.
- If the issue persists on a healthy cluster, cancel the run and retry. Repeated failures with no resource explanation are worth sending to support with the run ID.
Job run fails at start: "failed to create kubernetes client"
What happened: LakeSail couldn't open the tunnel to your cluster. Either the cluster is unreachable from the control plane (VPC changed, security groups modified, NAT gateway down) or the cluster's credentials rotated out from under it.
Fix:
- Check the cluster status. If it's
failedorupdating, wait for it to return toactivebefore retrying. - If you recently modified the network's security groups or route tables outside LakeSail, revert. The VPC is managed — changes made directly in AWS can break connectivity.
- If the cluster looks healthy but every run fails with this error, contact support with the run ID.
Session: token rejected
What happened: Tokens issued from sessions are short-lived. A token that worked an hour ago may no longer be valid.
Fix: Generate a new token from the session detail page. If your BI tool supports it, configure token rotation; otherwise plan to refresh manually.
Still stuck?
- Status pages to cross-check: .
- Getting help: . Include the organization ID, the resource ID that's failing (cloud account / network / cluster / run), and the timestamp.