PostgreSQL
Sail includes a Python DataSource V2 implementation for reading PostgreSQL tables as Spark DataFrames. It supports automatic schema inference, filter pushdown, and parallel partition-based reading.
Installation
Install PySail with the PostgreSQL extra:
bash
pip install "pysail[postgres]==0.5.1+11229a5"This pulls in psycopg2-binary and pyarrow as additional dependencies.
Quick Start
INFO
In the code below, spark refers to a Spark client session connected to the Sail server. You can refer to the Getting Started guide for how it works.
python
from pysail.datasources.postgres import PostgresDataSource
spark.dataSource.register(PostgresDataSource)
df = spark.read.format("postgres").options(
url="jdbc:postgresql://localhost:5432/mydb",
user="myuser",
password="mypassword",
dbtable="users",
).load()
df.show()Parallel Reading
Split reads across multiple partitions using a numeric column. Each partition reads rows where MOD(partitionColumn, numPartitions) = partition_id.
python
df = spark.read.format("postgres").options(
url="jdbc:postgresql://localhost:5432/mydb",
user="myuser",
password="mypassword",
dbtable="large_table",
numPartitions="4",
partitionColumn="id",
).load()Filter Pushdown
Comparison filters (=, >, >=, <, <=) are automatically pushed down to PostgreSQL, so only matching rows are transferred over the network.
python
# This WHERE clause runs in PostgreSQL, not in Spark
df.filter("age > 25").show()Options
| Option | Required | Default | Description |
|---|---|---|---|
url | Yes | PostgreSQL JDBC URL (jdbc:postgresql://host:port/database) | |
user | Yes | Username | |
password | Yes | Password | |
dbtable | Yes | Table name | |
tableSchema | No | public | PostgreSQL schema containing the table |
numPartitions | No | 1 | Number of parallel readers (must be ≥ 1) |
partitionColumn | No | Column for partitioning (required when numPartitions > 1) | |
fetchsize | No | 8192 | Number of rows fetched per batch (must be ≥ 1) |
