Arrow Semantics

Sail supports all Arrow data types, including those not supported by JVM Spark. This enhances interoperability with other libraries that use Arrow. However, since the Spark Connect client is still used, you need to consider a few limitations when using these data types.

INFO

In the code below, spark refers to a Spark client session connected to the Sail server. You can refer to the Getting Started guide for how it works.

Data Reading and Writing

Sail allows you to read and write all Arrow data types as long as the data source supports them. For example, the following operation works in Sail but is not supported in JVM Spark.

python

spark.sql("SELECT 0::UINT16 AS c").write.parquet("data.parquet")

The Arrow schema is stored in the Parquet file metadata. This allows other libraries that support Arrow to read the data back correctly. For example, DuckDB will read the data as unsigned 16-bit integers.

If you read the data back in Sail, the data type is preserved. Note that this behavior is different from JVM Spark, where the data type is changed to IntegerType (32-bit signed integer) for the dataset written above.

python

spark.read.parquet("data.parquet").selectExpr("typeof(c)").show()

INFO

We may change this Parquet reading behavior in the future, if it turns out to be important to match the JVM Spark behavior for compatibility reasons.

Query Results in Python

Sail uses Arrow internally, so all Arrow data types work naturally in the entire query execution process. For PySpark applications, however, the Spark Connect client library controls how to interpret Arrow data received from the server. So you need to be aware of the complications when the data is moved across the boundary between Sail and the PySpark client. Such data move happens when "previewing" query results in Python.

For example, the following operations do not fully work even if query execution is successful within Sail. The query results are either cast to a different type implicitly, or an error is raised due to unsupported Arrow types in Spark.

python

spark.sql("SELECT 0::UINT16").collect()
spark.sql("SELECT 0::UINT16").toPandas()
spark.sql("SELECT 0::UINT16").toArrow()

You can consider casting the data explicitly to a supported Spark data type before collecting the results in Python.

Spark DataFrame API

Data Types

Spark SQL

Data Types

Literals

Functions and Operators

User-Defined Functions

Data Formats

Data Storage

Catalog

Integrations

Deployment

Building Docker Images

Arrow Semantics

Data Reading and Writing

Query Results in Python

Data Types

Data Types

Literals

Building Docker Images

Arrow Semantics ​

Data Reading and Writing ​

Query Results in Python ​

Arrow Semantics

Data Reading and Writing

Query Results in Python