Data Storage
Sail provides a unified interface for reading and writing data across various storage systems, from local file systems to cloud object stores and distributed file systems. This abstraction allows you to seamlessly work with data regardless of where it's stored, using the same familiar Spark APIs.
Quick Examples
INFO
In the code below, spark
refers to a Spark session connected to the Sail server. You can refer to the Getting Started guide for how it works.
# Local file system
df = spark.read.parquet("/path/to/local/data")
df = spark.read.parquet("file:///path/to/local/data")
# Cloud storage
df = spark.read.parquet("s3://bucket/data")
df = spark.read.parquet("azure://container/data")
df = spark.read.parquet("gs://bucket/data")
# In-memory storage
df = spark.read.parquet("memory:///cached/data")
# HTTP/HTTPS endpoints
df = spark.read.json("https://api.example.com/data.json")
# Create tables from any storage
spark.sql("""
CREATE TABLE my_table (id INT, name STRING)
USING parquet
LOCATION 's3://bucket/path/to/data'
""")
Storage Support Matrix
Here is a summary of the supported (✅) and unsupported (❌) storage features for reading and writing data. There are also features that are planned in our roadmap (🚧).
Storage | Read Support | Write Support |
---|---|---|
File Systems | ✅ | ✅ |
Memory | ✅ | ✅ |
AWS S3 | ✅ | ✅ |
Cloudflare R2 | ✅ | ✅ |
Azure Data Lake Storage (ADLS) | ✅ | ✅ |
Azure Blob Storage | ✅ | ✅ |
Google Cloud Storage | ✅ | ✅ |
HDFS | ✅ | ✅ |
Hugging Face | ✅ | ❌ |
HTTP/HTTPS | ✅ | ✅ |
JDBC | ❌ | ❌ |
Special URL Handling
Some HTTPS URLs are automatically recognized as cloud storage:
- S3: URLs containing
amazonaws.com
orr2.cloudflarestorage.com
. - Azure: URLs containing
dfs.core.windows.net
,blob.core.windows.net
,dfs.fabric.microsoft.com
, orblob.fabric.microsoft.com
.
These URLs will use the appropriate cloud storage backend instead of the generic HTTP store.