Changelog
0.3.1
July 7, 2025
- Added support for the following SQL functions (#570, #571, #582, #585, and #586):
dayname
nullifzero
zeroifnull
split
(partial support)collect_set
count_if
- Fixed issues with the
from_utc_timestamp
SQL function (#596). - Added support for the
DataFrame.sampleBy
method in the Spark DataFrame API (#547). - Added support for the following SQL statements (#588):
SHOW COLUMNS
SHOW DATABASES
SHOW TABLES
SHOW VIEWS
- Improved data source listing performance (#579).
- Improved the internal logic of data source options (#587 and #598).
- Updated gRPC server TCP and HTTP configuration (#593).
Contributors
Huge thanks to @SparkApplicationMaster for the first contributions related to SQL functions!
Huge thanks to @davidlghellin for the continued contributions related to the Spark DataFrame API!
0.3.0
June 28, 2025
The 0.3.0 release introduces support for Spark 4.0 in Sail, alongside the existing support for Spark 3.5. One of the most notable changes in Spark 4.0 is the new pyspark-client
package, a lightweight PySpark client. When using Sail in your PySpark applications, you can now choose to install this client package, instead of the full pyspark
package that includes all the JAR files.
Here is a summary of the new features and improvements in this release.
- Improved remote data access performance by caching object stores (#515).
- Added support for data reader and writer configuration (#466 and #535).
- Added support for the following SQL functions (#527):
crc32
sha
sha1
- Fixed issues with casting integers to timestamps (#533).
- Fixed issues with the
random
andrandn
SQL functions (#530). - Added support for the
DataFrame.sample
method in the Spark DataFrame API (#496). - Added support for Spark 4.0 (#467, #498, and #559).
- Updated the default value of a few configuration options (#565).
Breaking Changes
The spark
"extra" has been removed from the pysail
package. As a result, you can no longer use commands like pip install pysail[spark]
to install Sail along with Spark. Instead, you must install the PySpark package separately in your Python environment.
This change allows you to freely choose the PySpark version when using Sail. Depending on your requirements, you can opt for either the pyspark
package (Spark 3.5 or later) or the pyspark-client
package (introduced in Spark 4.0).
Contributors
We are thrilled by the growing interest from the community. Huge thanks to @rafafrdz, @davidlghellin, @lonless9, and @pimlie for making their first contributions to Sail!
0.2.6
May 14, 2025
- Improved temporal data type casting and display (#448).
- Corrected the time unit for reading
INT96
timestamp data from Parquet files (#444). - Fixed issues with column metadata in the Spark DataFrame API (#447).
- Supported referring to aliased aggregation expressions in Spark SQL
GROUP BY
andHAVING
clauses (#456). - Supported more data formats and added directory listing endpoints in the MCP server (#455 and #458).
0.2.5
April 22, 2025
- Corrected Spark session default time zone configuration and fixed various issues for timestamp data types (#438).
- Improved object store setup and cluster mode task execution (#432).
0.2.4
April 10, 2025
- Improved MCP server logging (#421).
- Improved AWS S3 data access (#426).
- Supported AWS credential caching (#430).
- Fixed issues with cluster mode task execution (#429).
- Supported
exceptAll()
andtail()
in the Spark DataFrame API (#417).
0.2.3
March 21, 2025
- Implemented MCP (Model Context Protocol) server (#410).
- Supported the
hf://
protocol for reading Hugging Face datasets (#412). - Supported glob patterns in data source URLs (#415).
- Supported a few data reader and writer options for CSV files (#414).
- Fixed a few issues with SQL temporary views (#413).
- Improved task error reporting in cluster mode (#409).
0.2.2
March 6, 2025
- Switched to the built-in SQL parser (#338, #358, #359, and #376).
- Supported the majority of Spark SQL syntax (#378, #380, #382, #385, #387, #389, and #390).
- Expanded support for Spark SQL functions (#364, #384, and #391).
- Fixed issues with
join()
in the Spark DataFrame API (#392). - Supported
NATURAL JOIN
in Spark SQL (#396). - Fixed an issue with SQL window expressions (#386).
- Fixed result parity issues with derived TPC-DS queries (#393).
0.2.1
January 15, 2025
- Supported SQL table functions and lateral views (#326 and #327).
- Supported PySpark UDTFs (#329).
- Improved literal and data type support (#317, #328, #330, and #339).
- Supported
ANTI JOIN
andSEMI JOIN
(#337). - Fixed a few PySpark UDF issues (#343).
- Supported nested fields in SQL (#340).
- Supported more queries in the derived TPC-DS benchmark (#346).
- Supported more datetime functions (#349).
0.2.0
December 3, 2024
We are excited to announce the first Sail release with the distributed processing capability. Spark SQL and DataFrame queries can now run on Kubernetes, powered by the Sail distributed compute engine. We also introduced a new Sail CLI and a configuration mechanism that will serve as the entrypoint for all Sail features moving forward.
We continued extending coverage for Spark SQL functions and the Spark DataFrame API. The changes are listed below.
- Supported the following DataFrame and SQL functions (#278 and #305).
DataFrame.crosstab
DataFrame.replace
DataFrame.to
reverse
aes_decrypt
aes_encrypt
try_aes_decrypt
base64
unbase64
weekofyear
- Supported
mapInPandas()
andmapInArrow()
for Spark DataFrame (#310). - Supported
applyInPandas()
for grouped and co-grouped Spark DataFrame (#313).
Breaking Changes
This release comes with the new Sail CLI, and the way to launch the Spark Connect server and PySpark shell is different from the 0.1.x versions. Please refer to the Getting Started page for the updated instructions.
0.1.7
November 1, 2024
- Expanded support for Spark DataFrame functions (#268 and #261). Added full parity and coverage for the following DataFrame and SQL functions.
DataFrame.summary
DataFrame.describe
DataFrame.corr
DataFrame.cov
DataFrame.stat
DataFrame.drop
corr
regr_avgx
- Fixed most issues with
ORDER BY
in the derived TPC-DS benchmark, bringing total coverage to 74 out of the 99 queries (#261).
We also made significant changes to the Sail internals to support distributed processing. We are targeting the 0.2.0 release in the next few weeks for an MVP (minimum viable product) of this exciting feature. Please stay tuned! If you are interested in the ongoing work, you can follow #246 in our GitHub repository to get the latest updates!
0.1.6
October 23, 2024
0.1.5
October 17, 2024
- Expanded support for Spark SQL syntax and functions (#239 and #247). Added full parity and coverage for the following SQL functions.
current_catalog
current_database
current_schema
hash
hex
unhex
xxhash64
unix_timestamp
- Fixed a few issues with
JOIN
(#250).
0.1.4
October 03, 2024
- Enabled Avro in DataFusion (#234).
- Expanded support for Spark SQL syntax and functions (#213 and #207). Added full parity and coverage for the following SQL functions.
array
date_format
get_json_object
json_array_length
overlay
replace
split_part
to_date
any_value
approx_count_distinct
current_timezone
first_value
greatest
last
last_value
least
map_contains_key
map_keys
map_values
min_by
substr
sum_distinct
- Supported HDFS (#196).
- Supported parsing value prefixes followed by whitespace (#218 and lakehq/sqlparser-rs#6).
- Added basic support for Python UDAF (#214).
Contributors
Huge thanks to our first community contributor, @skewballfox for adding support for HDFS!!
0.1.3
September 18, 2024
- Supported column positions in
GROUP BY
andORDER BY
(#205). - Expanded support for
INSERT
statements (#195). - Fixed issues with Spark configuration (#192).
- Expanded support for
CREATE
andREPLACE
statements (#183). - Supported
GROUPING SETS
aggregation (#184). - Integrated fastrace for more performant logging and tracing (#166).
- Enabled gzip and zstd compression in Tonic (#166).
0.1.2
September 10, 2024
- Fixed issues with aggregation queries.
- Extended support for SQL functions.
- Added support for temporary views and global temporary views.
0.1.1
September 03, 2024
- Extended support for SQL statements and SQL functions.
- Fixed a performance issue for the PySpark DataFrame
show()
method.
0.1.0
August 29, 2024
This is the first Sail release.