HDFS

You can use the hdfs:// URI to read and write data stored in Hadoop Distributed File System (HDFS).

For example, hdfs://namenode:port/path/to/data is a path in HDFS.

Basic Usage

INFO

In the code below, spark refers to a Spark client session connected to the Sail server. You can refer to the Getting Started guide for how it works.

python

df = spark.range(1000)
df.write.mode("overwrite").parquet("hdfs://namenode:9000/user/data")

Kerberos Authentication

INFO

Kerberos authentication for HDFS is supported and tested with Sail.

Prerequisites

A Kerberos-enabled HDFS cluster configured with Kerberos authentication. See Apache Hadoop Secure Mode for details.
A valid keytab file for the principal that will access HDFS.
A krb5.conf file on the Sail server host. See MIT Kerberos Documentation for details.

Starting the Sail Server

Authenticate with Kerberos before starting the Sail server.

python

import subprocess
from pysail.spark import SparkConnectServer

# authenticate with Kerberos
subprocess.run([
    "kinit", "-kt",
    "/path/to/user.keytab",
    "username@YOUR.REALM"
], check=True)

# start the Sail server
server = SparkConnectServer(ip="0.0.0.0", port=50051)
server.start(background=False)

TIP

The Sail server runs in local mode by default. The server process uses a single Kerberos ticket from kinit.

If running Sail in cluster mode (e.g. on Kubernetes), each worker instance requires its own Kerberos authentication via kinit.

Client Connection

The client does not need Kerberos authentication. The Sail server handles HDFS authentication.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .remote("sc://localhost:50051") \
    .getOrCreate()

# write to Kerberos-secured HDFS
df = spark.range(1000)
df.write.mode("overwrite") \
    .parquet("hdfs://namenode:9000/user/username/data")

Spark DataFrame API

Data Types

Spark SQL

Data Types

Literals

Functions and Operators

User-Defined Functions

Data Formats

Data Storage

Catalog

Integrations

Deployment

Building Docker Images

HDFS

Basic Usage

Kerberos Authentication

Prerequisites

Starting the Sail Server

Client Connection

Additional Resources

Data Types

Data Types

Literals

Building Docker Images

HDFS ​

Basic Usage ​

Kerberos Authentication ​

Prerequisites ​

Starting the Sail Server ​

Client Connection ​

Additional Resources ​

HDFS

Basic Usage

Kerberos Authentication

Prerequisites

Starting the Sail Server

Client Connection

Additional Resources