PostgreSQL

Sail includes a Python DataSource V2 implementation for reading PostgreSQL tables as Spark DataFrames. It supports automatic schema inference, filter pushdown, and parallel partition-based reading.

Installation

Install PySail with the PostgreSQL extra:

bash

pip install "pysail[postgres]==0.5.1+11229a5"

This pulls in psycopg2-binary and pyarrow as additional dependencies.

Quick Start

INFO

In the code below, spark refers to a Spark client session connected to the Sail server. You can refer to the Getting Started guide for how it works.

python

from pysail.datasources.postgres import PostgresDataSource

spark.dataSource.register(PostgresDataSource)

df = spark.read.format("postgres").options(
    url="jdbc:postgresql://localhost:5432/mydb",
    user="myuser",
    password="mypassword",
    dbtable="users",
).load()

df.show()

Parallel Reading

Split reads across multiple partitions using a numeric column. Each partition reads rows where MOD(partitionColumn, numPartitions) = partition_id.

python

df = spark.read.format("postgres").options(
    url="jdbc:postgresql://localhost:5432/mydb",
    user="myuser",
    password="mypassword",
    dbtable="large_table",
    numPartitions="4",
    partitionColumn="id",
).load()

Filter Pushdown

Comparison filters (=, >, >=, <, <=) are automatically pushed down to PostgreSQL, so only matching rows are transferred over the network.

python

# This WHERE clause runs in PostgreSQL, not in Spark
df.filter("age > 25").show()

Options

Option	Required	Default	Description
`url`	Yes		PostgreSQL JDBC URL (`jdbc:postgresql://host:port/database`)
`user`	Yes		Username
`password`	Yes		Password
`dbtable`	Yes		Table name
`tableSchema`	No	`public`	PostgreSQL schema containing the table
`numPartitions`	No	`1`	Number of parallel readers (must be ≥ 1)
`partitionColumn`	No		Column for partitioning (required when `numPartitions > 1`)
`fetchsize`	No	`8192`	Number of rows fetched per batch (must be ≥ 1)

Spark DataFrame API

Data Types

Spark SQL

Data Types

Literals

Functions and Operators

User-Defined Functions

Data Formats

Data Storage

Catalog

System

Integrations

Deployment

Building Docker Images

PostgreSQL

Installation

Quick Start

Parallel Reading

Filter Pushdown

Options

Data Types

Data Types

Literals

System

Building Docker Images

PostgreSQL ​

Installation ​

Quick Start ​

Parallel Reading ​

Filter Pushdown ​

Options ​

PostgreSQL

Installation

Quick Start

Parallel Reading

Filter Pushdown

Options