HDFS
You can use the hdfs:// URI to read and write data stored in Hadoop Distributed File System (HDFS).
For example, hdfs://namenode:port/path/to/data is a path in HDFS.
Basic Usage
INFO
In the code below, spark refers to a Spark session connected to the Sail server. You can refer to the Getting Started guide for how it works.
df = spark.range(1000)
df.write.mode("overwrite").parquet("hdfs://namenode:9000/user/data")Kerberos Authentication
INFO
Kerberos authentication for HDFS is supported and tested with Sail.
Prerequisites
- A Kerberos-enabled HDFS cluster configured with Kerberos authentication. See Apache Hadoop Secure Mode for details.
- A valid keytab file for the principal that will access HDFS.
- A
krb5.conffile on the Sail server host. See MIT Kerberos Documentation for details.
Starting the Sail Server
Authenticate with Kerberos before starting the Sail server.
import subprocess
from pysail.spark import SparkConnectServer
# authenticate with Kerberos
subprocess.run([
"kinit", "-kt",
"/path/to/user.keytab",
"username@YOUR.REALM"
], check=True)
# start the Sail server
server = SparkConnectServer(ip="0.0.0.0", port=50051)
server.start(background=False)TIP
The Sail server runs in local mode by default. The server process uses a single Kerberos ticket from kinit.
If running Sail in cluster mode (e.g. on Kubernetes), each worker instance requires its own Kerberos authentication via kinit.
Client Connection
The client does not need Kerberos authentication. The Sail server handles HDFS authentication.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.remote("sc://localhost:50051") \
.getOrCreate()
# write to Kerberos-secured HDFS
df = spark.range(1000)
df.write.mode("overwrite") \
.parquet("hdfs://namenode:9000/user/username/data")