AWS S3
Sail supports reading and writing data to AWS S3 and S3-compatible object storage services (such as MinIO or Cloudflare R2) using the s3://
, s3a://
, or https://
URI schemes.
For example, s3://bucket/path/to/data
refers to the path/to/data
path in the bucket
bucket. Sail determines whether the path refers to a single object or a key prefix. If the path turns out to be a key prefix, we assume the key prefix is followed by /
and represents a directory.
URI Formats
Sail supports the following URL formats for S3 and S3-compatible object storage services:
s3
ands3a
protocolss3://bucket/key
s3a://bucket/key
- HTTPS endpoints for AWS S3 virtual-hosted style requests
https://bucket.s3.region.amazonaws.com/key
https://bucket.s3.amazonaws.com/key
- HTTPS endpoints for AWS S3 path-style requests
https://s3.region.amazonaws.com/bucket/key
https://s3.amazonaws.com/bucket/key
- HTTPS endpoint for AWS S3 Express One Zone
https://bucket.s3express-zone-id.region.amazonaws.com/key
- HTTPS endpoints for AWS S3 Transfer Acceleration
https://bucket.s3-accelerate.amazonaws.com/key
https://bucket.s3-accelerate.dualstack.amazonaws.com/key
bucket
is the bucket name. key
can refer to a single object or a key prefix. region
is the name of the AWS region where the bucket is located, such as us-east-1
.
s3express-zone-id
is the availability zone ID where the S3 Express One Zone bucket is located, such as use1-az4
. Note that S3 Express One Zone bucket name follows a specific format. For example, my-bucket--use1-az4--x-s3
is a valid bucket name in the use1-az4
availability zone in the us-east-1
region.
INFO
- The
s3a
URI scheme has no difference from thes3
URI scheme in Sail. Thes3a
scheme is provided for compatibility with Spark applications that use the Hadoop S3A connector. - The
s3
ands3a
URI schemes are also applicable to AWS S3 Express One Zone buckets. - AWS has discontinued support for path-style requests for new buckets created after September 30, 2020. For more information, see Amazon S3 Path Deprecation Plan – The Rest of the Story in the AWS News Blog.
AWS Credentials
All AWS credential providers work out-of-box in Sail. You can authenticate with AWS S3 using any of the supported methods, including AWS config
and credentials
files, EC2 instance profiles, environment variables, and container credentials.
Credential rotation happens automatically if you use temporary credentials.
INFO
You can refer to the AWS documentation for more details about the credential providers.
AWS Region Configuration
You can configure the AWS region via the AWS_REGION
environment variable. In this case, if the region is not explicitly set in the URI, the S3 bucket must be in the configured region.
For example, if you configure the AWS region using the following command, an error will be returned when accessing an S3 bucket not in the us-east-1
region.
export AWS_REGION="us-east-1"
If you set the AWS_REGION
environment variable to an empty string, the region for the S3 bucket will be inferred automatically. In this way, you can access S3 data in all regions without explicitly specifying the region in the URI.
export AWS_REGION=""
Public Datasets on AWS
Some datasets on S3 allow public access without an AWS account. You can skip retrieving AWS credentials by setting the environment variable AWS_SKIP_SIGNATURE=true
.
export AWS_SKIP_SIGNATURE=true
df = spark.read.parquet("s3://some-public-bucket/path/to/data")
INFO
AWS_SKIP_SIGNATURE
is a Sail-specific environment variable, not part of standard AWS SDKs.
S3-Compatible Services
Cloudflare R2
You can configure the endpoint and credentials for Cloudflare R2 using environment variables. Here is an example.
export AWS_ACCESS_KEY_ID="smooth"
export AWS_SECRET_ACCESS_KEY="sailing"
export AWS_ENDPOINT="https://my-account-id.r2.cloudflarestorage.com"
MinIO
You can configure the endpoint and credentials for MinIO using environment variables. Here is an example.
export AWS_ACCESS_KEY_ID="smooth"
export AWS_SECRET_ACCESS_KEY="sailing"
export AWS_ENDPOINT="http://localhost:9000"
Other storage services that are compatible with the AWS S3 API may be configured similarly. You can refer to the service documentation for more details.
Examples
INFO
In the code below, spark
refers to a Spark session connected to the Sail server. You can refer to the Getting Started guide for how it works.
Spark DataFrame API
# You can use any valid URI format for S3 or S3-compatible services
# to specify the path to read or write data.
path = "s3://my-bucket/path/to/data"
df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], schema="id INT, name STRING")
df.write.parquet(path)
df = spark.read.parquet(path)
df.show()
Spark SQL
# You can use any valid URI format for S3 or S3-compatible services
# to specify the location of the table.
sql = """
CREATE TABLE my_table (id INT, name STRING)
USING parquet
LOCATION 's3://my-bucket/path/to/data'
"""
spark.sql(sql)
spark.sql("SELECT * FROM my_table").show()
spark.sql("INSERT INTO my_table VALUES (3, 'Charlie'), (4, 'David')")
spark.sql("SELECT * FROM my_table").show()