Kubernetes
INFO
LakeSail offers flexible enterprise support options, including managing Sail on Kubernetes.
Get in touch to learn more.
Sail supports distributed data processing on Kubernetes clusters. This guide demonstrates how to deploy Sail on a Kubernetes cluster and connect to it via Spark Connect.
Building the Docker Image
We first need to build the Docker image for Sail. Please refer to the Docker Images Guide for more information.
Loading the Docker Image
You will need to make the Docker image available to your Kubernetes cluster. The exact steps depend on your Kubernetes environment. For example, you may want to push the Docker image to your private Docker registry accessible from the Kubernetes cluster.
If you are using a local Kubernetes cluster, you may need to load the Docker image into the cluster. The command varies depending on the Kubernetes distribution you are using. Here are the examples for some well-known Kubernetes distributions. You can refer to their documentation for more details.
kind load docker-image sail:latest
minikube image load sail:latest
k3d image import sail:latest
docker save sail:latest | k3s ctr images import -
docker save sail:latest | microk8s images import
The following sections use kind as an example, but you can run Sail in other Kubernetes distributions of your choice. Run the following command to create a local Kubernetes cluster.
kind create cluster
Then load the Docker image into the cluster.
kind load docker-image sail:latest
Running the Sail Server
INFO
The way to configure Sail applications is not stable yet. Please refer to the Changelog if the configuration ever changes in future releases.
Create a file named sail.yaml
with the following content.
apiVersion: v1
kind: Namespace
metadata:
name: sail
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sail-spark-server
namespace: sail
labels:
app.kubernetes.io/name: sail
app.kubernetes.io/component: spark-server
spec:
# We cannot have more than one replica because each Spark session is tied to a single pod.
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: sail
app.kubernetes.io/component: spark-server
template:
metadata:
labels:
app.kubernetes.io/name: sail
app.kubernetes.io/component: spark-server
spec:
serviceAccountName: sail-user
containers:
- name: server
image: sail:latest
command: [ "sail" ]
args: [ "spark", "server", "--port", "50051" ]
ports:
- containerPort: 50051
imagePullPolicy: IfNotPresent
env:
- name: RUST_LOG
value: info
- name: SAIL_MODE
value: "kubernetes-cluster"
- name: SAIL_CLUSTER__DRIVER_LISTEN_HOST
value: "0.0.0.0"
- name: SAIL_CLUSTER__DRIVER_EXTERNAL_HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: SAIL_KUBERNETES__IMAGE
value: sail:latest
- name: SAIL_KUBERNETES__NAMESPACE
value: sail
- name: SAIL_KUBERNETES__DRIVER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
---
apiVersion: v1
kind: Service
metadata:
name: sail-spark-server
namespace: sail
labels:
app.kubernetes.io/name: sail
app.kubernetes.io/component: spark-server
spec:
selector:
app.kubernetes.io/name: sail
app.kubernetes.io/component: spark-server
ports:
- protocol: TCP
port: 50051
targetPort: 50051
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: sail-spark-server
namespace: sail
rules:
- apiGroups: [ "" ]
resources: [ "pods" ]
verbs: [ "*" ]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: sail-user
namespace: sail
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: sail-spark-server
namespace: sail
subjects:
- kind: ServiceAccount
name: sail-user
namespace: sail
roleRef:
kind: Role
name: sail-spark-server
apiGroup: rbac.authorization.k8s.io
Create the Kubernetes resources using the following command. The Sail Spark Connect server runs as a Kubernetes deployment, and the gRPC port is exposed as a Kubernetes service.
kubectl apply -f sail.yaml
Running Spark Jobs
To connect to the Sail Spark Connect server, you can forward the service port to your local machine.
kubectl -n sail port-forward service/sail-spark-server 50051:50051
Now you can run Spark jobs using the standalone Spark client. Here is an example of running PySpark shell powered by Sail on Kubernetes. Sail worker pods are launched on-demand to execute Spark jobs. The worker pods are terminated after a certain period of inactivity.
env SPARK_CONNECT_MODE_ENABLED=1 SPARK_REMOTE="sc://localhost:50051" pyspark
Cleaning Up
Run the following command to clean up the Kubernetes resources for the Sail server. All Sail worker pods will be terminated automatically as well.
kubectl delete -f sail.yaml
You can delete the kind cluster using the following command.
kind delete cluster