AWS S3

Sail supports reading and writing data to AWS S3 and S3-compatible object storage services (such as MinIO or Cloudflare R2) using the s3://, s3a://, or https:// URI schemes.

For example, s3://bucket/path/to/data refers to the path/to/data path in the bucket bucket. Sail determines whether the path refers to a single object or a key prefix. If the path turns out to be a key prefix, we assume the key prefix is followed by / and represents a directory.

URI Formats

Sail supports the following URL formats for S3 and S3-compatible object storage services:

s3 and s3a protocols
- s3://bucket/key
- s3a://bucket/key
HTTPS endpoints for AWS S3 virtual-hosted style requests
- https://bucket.s3.region.amazonaws.com/key
- https://bucket.s3.amazonaws.com/key
HTTPS endpoints for AWS S3 path-style requests
- https://s3.region.amazonaws.com/bucket/key
- https://s3.amazonaws.com/bucket/key
HTTPS endpoint for AWS S3 Express One Zone
- https://bucket.s3express-zone-id.region.amazonaws.com/key
HTTPS endpoints for AWS S3 Transfer Acceleration
- https://bucket.s3-accelerate.amazonaws.com/key
- https://bucket.s3-accelerate.dualstack.amazonaws.com/key

bucket is the bucket name. key can refer to a single object or a key prefix. region is the name of the AWS region where the bucket is located, such as us-east-1.

s3express-zone-id is the availability zone ID where the S3 Express One Zone bucket is located, such as use1-az4. Note that S3 Express One Zone bucket name follows a specific format. For example, my-bucket--use1-az4--x-s3 is a valid bucket name in the use1-az4 availability zone in the us-east-1 region.

INFO

The s3a URI scheme has no difference from the s3 URI scheme in Sail. The s3a scheme is provided for compatibility with Spark applications that use the Hadoop S3A connector.
The s3 and s3a URI schemes are also applicable to AWS S3 Express One Zone buckets.
AWS has discontinued support for path-style requests for new buckets created after September 30, 2020. For more information, see Amazon S3 Path Deprecation Plan – The Rest of the Story in the AWS News Blog.

AWS Credentials

All AWS credential providers work out-of-box in Sail. You can authenticate with AWS S3 using any of the supported methods, including AWS config and credentials files, EC2 instance profiles, environment variables, and container credentials.

Credential rotation happens automatically if you use temporary credentials.

INFO

You can refer to the AWS documentation for more details about the credential providers.

AWS Region Configuration

You can configure the AWS region via the AWS_REGION environment variable. In this case, if the region is not explicitly set in the URI, the S3 bucket must be in the configured region.

For example, if you configure the AWS region using the following command, an error will be returned when accessing an S3 bucket not in the us-east-1 region.

bash

export AWS_REGION="us-east-1"

If you set the AWS_REGION environment variable to an empty string, the region for the S3 bucket will be inferred automatically. In this way, you can access S3 data in all regions without explicitly specifying the region in the URI.

bash

export AWS_REGION=""

Public Datasets on AWS

Some datasets on S3 allow public access without an AWS account. You can skip retrieving AWS credentials by setting the environment variable AWS_SKIP_SIGNATURE=true.

bash

export AWS_SKIP_SIGNATURE=true

python

df = spark.read.parquet("s3://some-public-bucket/path/to/data")

INFO

AWS_SKIP_SIGNATURE is a Sail-specific environment variable, not part of standard AWS SDKs.

S3-Compatible Services

Cloudflare R2

You can configure the endpoint and credentials for Cloudflare R2 using environment variables. Here is an example.

bash

export AWS_ACCESS_KEY_ID="smooth"
export AWS_SECRET_ACCESS_KEY="sailing"
export AWS_ENDPOINT="https://my-account-id.r2.cloudflarestorage.com"

MinIO

You can configure the endpoint and credentials for MinIO using environment variables. Here is an example.

bash

export AWS_ACCESS_KEY_ID="smooth"
export AWS_SECRET_ACCESS_KEY="sailing"
export AWS_ENDPOINT="http://localhost:9000"

Other storage services that are compatible with the AWS S3 API may be configured similarly. You can refer to the service documentation for more details.

Examples

INFO

In the code below, spark refers to a Spark client session connected to the Sail server. You can refer to the Getting Started guide for how it works.

Spark DataFrame API

python

# You can use any valid URI format for S3 or S3-compatible services
# to specify the path to read or write data.
path = "s3://my-bucket/path/to/data"

df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], schema="id INT, name STRING")
df.write.parquet(path)

df = spark.read.parquet(path)
df.show()

Spark SQL

python

# You can use any valid URI format for S3 or S3-compatible services
# to specify the location of the table.
sql = """
CREATE TABLE my_table (id INT, name STRING)
USING parquet
LOCATION 's3://my-bucket/path/to/data'
"""
spark.sql(sql)
spark.sql("SELECT * FROM my_table").show()

spark.sql("INSERT INTO my_table VALUES (3, 'Charlie'), (4, 'David')")
spark.sql("SELECT * FROM my_table").show()

Spark DataFrame API

Data Types

Spark SQL

Data Types

Literals

Functions and Operators

User-Defined Functions

Data Formats

Data Storage

Catalog

Integrations

Deployment

Building Docker Images

AWS S3

URI Formats

AWS Credentials

AWS Region Configuration

Public Datasets on AWS

S3-Compatible Services

Cloudflare R2

MinIO

Examples

Spark DataFrame API

Spark SQL

Data Types

Data Types

Literals

Building Docker Images

AWS S3 ​

URI Formats ​

AWS Credentials ​

AWS Region Configuration ​

Public Datasets on AWS ​

S3-Compatible Services ​

Cloudflare R2 ​

MinIO ​

Examples ​

Spark DataFrame API ​

Spark SQL ​

AWS S3

URI Formats

AWS Credentials

AWS Region Configuration

Public Datasets on AWS

S3-Compatible Services

Cloudflare R2

MinIO

Examples

Spark DataFrame API

Spark SQL