Changelog
0.5.1
February 15, 2026
- Added support for
DataFrame.mergeInto()in the Spark DataFrame API (#1273). - Added support for the
TABLESAMPLEclause for SQL queries (#1332). - Added support for the PySpark Python Data Source API and batch data reader (#1291, #1336, #1353, and #1374).
- Added support for partition transforms in the Spark
DataFrameWriterV2API (#1307). - Added support for subquery expressions in the Spark DataFrame API (#1289 and #1356).
- Added support for the
DESCRIBE TABLESQL statement (#1364). - Added support for geospatial types for Spark 4.1 (#1325).
- Added support for duplicated CTE names with shadowing behavior (#1331).
- Added support for the
TRY_CASTexpression in SQL queries (#1349). - Added support for the following SQL functions (#1071, #1264, #1322, #1323, #1324, and #1347):
percentilemonotonically_increasing_idhistogram_numericregexp_substrformat_numberformat_string
- Improved the following SQL functions (#1204, #1286, #1327, #1329, and #1341):
sequencentileconcatarray_concatconcat_wsdate_diffdatediffdate_formatto_dateunix_timestampto_timestamptry_to_timestampfrom_unixtime
- Improved
ONcondition column resolution in semi joins and anti joins (#1357). - Improved
DESCRIBESQL statement parsing (#1366). - Improved the task stream logic in distributed query execution (#1367).
Contributors
Huge thanks to @davidlghellin, @pomykalakyle, @santosh-d3vpl3x (first-time contributor), @james-willis (first-time contributor), and @wudihero2 for your contributions!
0.5.0
February 6, 2026
- Redesigned the control plane for distributed query execution (#1164, #1242, #1247, #1265, and #1280).
- Added support for system catalog (#1216).
- Added support for AWS Glue catalog (#1254 and #1279).
- Added support for OneLake catalog (#1217 and #1228).
- Added support for partition transforms for Iceberg REST catalog (#1269).
- Added support for the
SHOW CATALOGSandUSE CATALOGSQL statements (#1288). - Improved Delta Lake integration (#1222).
- Improved Iceberg integration (#1169).
- Added support for the
CREATE TABLE ... AS SELECT ...(CTAS) statement (#1236). - Added support for the
inferSchemaoption for CSV data source (#1223). - Added support for
DataFrame.colRegex()in the Spark DataFrame API (#1243). - Added support for
StructType.toDDL()in the Spark DataFrame API (#1285). - Added support for the following SQL functions (#1200, #1206, #1218, #1253, #1258, #1263, #1268, and #1276):
percentile_discto_jsonregex_extractsoundexrandstrarray_sort(without the lambda argument)array_joinarray_concattry_url_decode
- Improved the following SQL functions (#1186, #1187, #1252, #1256, #1257, #1260, #1262, #1277, and #1282):
json_array_lengthget_json_objectjson_object_keysfirst_valuelast_valueskewnesskurtosiscollect_setcollect_listmax_bymin_bycount_ifarrays_zipflatten
- Added support for negation and the
signumSQL function for interval data types (#1275). - Fixed issues with the
Column.try_cast()method in the Spark DataFrame API to handle invalid date and timestamp values correctly (#1221). - Improved the
DataFrame.randomSplit()method in the Spark DataFrame API to ensured deterministic order (#1235). - Improved the join reorder optimizer (#1234).
- Added memory and disk configuration options (#1311).
- Added support for inferring default catalog when only one catalog is configured (#1311).
Breaking Changes
- Python 3.9 is no longer supported since this version has reached its end-of-life (EOL) (#1302).
- The
SparkConnectServer.init_telemetry()method was removed from the Python API (#1302 and #1319). OpenTelemetry is now initialized automatically when the firstSparkConnectServerinstance is created, and OpenTelemetry shutdown is registered as a Pythonatexithandler. - The
cluster.worker_stream_bufferconfiguration option was renamed tocluster.task_stream_buffer(#1309). - The
cluster.job_output_bufferconfiguration option was removed since it is no longer needed (#1316).
Contributors
Huge thanks to @davidlghellin, @pomykalakyle, @djouallah (first-time contributor), @fafacao86 (first-time contributor), and @wudihero2 (first-time contributor) for your contributions!
0.4.6
January 13, 2026
- Improved Delta Lake integration (#1146, #1159, #1158, and #1161).
- Improved the internals for session management (#1138).
- Added support for reading CSV files with truncated rows (#1185).
- Added the configuration option for default parallelism (#1198).
- Added the
percentile_contSQL aggregate function (#1188). - Added support for non-literal expressions for map extraction (#1193).
- Added support for wildcard for the
structSQL function (#1197). - Fixed an issue with null value handling in the
map_concatSQL function (#1194). - Updated the PySpark compatibility checker example and function support status (#1127).
- Updated the TPC-H benchmark example (#1179).
- Added support for Spark 4.1.1 (#1199).
Contributors
Huge thanks to @davidlghellin, @keen85, and @pomykalakyle (first-time contributor) for your contributions!
0.4.5
December 22, 2025
- Added basic support for the Delta Lake merge operation (#1093, #1133, #1139, and #1144).
- Improved distributed query execution (#1128, #1134, #1135, and #1137).
- Improved Spark Connect server logic (#1126 and #1140).
- Added support for removing sessions (#1125).
- Added support for metrics and checkpoints for Delta Lake (#1136).
- Improved OpenTelemetry metric reporting (#1119).
- Improved the following SQL functions (#1105):
make_dt_intervalmake_intervalhexelt
- Updated Parquet configuration options (#1141).
- Updated the Spark Connect protocol for Spark 4.1 (#1145 and #1148).
- Fixed an issue with the
EXPLAINstatement output (#1147).
Contributors
Huge thanks to @davidlghellin for your contributions!
0.4.4
December 12, 2025
- Improved Delta Lake and Iceberg integration (#1098, #1095, #1108, #1115, #1109, and #1117).
- Added support for exporting logs, metrics, and traces to OpenTelemetry collectors (#1097, #1104, and #1116).
- Added a Python example for reporting Sail compatibility for PySpark code (#1075).
- Added support for customizing pod labels for Sail workers in Kubernetes deployments (#1103).
- Added support for the following SQL functions (#1106):
shufflebitwise_notformat_string
- Improved the output of the
EXPLAINstatement (#1110). - Fixed a few shuffle planning issues in distributed query execution (#1111).
- Fixed an issue with the
LIMITclause in distributed query execution (#1121). - Improved data source implementation (#1099).
Contributors
Huge thanks to @davidlghellin, @zemin-piao, @keen85 (first-time contributor), @YichiZhang0613 (first-time contributor), and @gstvg (first-time contributor) for your contributions!
0.4.3
November 26, 2025
- Added schema evolution support for Iceberg (#1048).
- Improved the following SQL functions (#1049, #1056, #1057, and #1077):
max_bymin_bysignumgreatestleastdiv
- Added support for
EXPLAINin SQL statements (#1078).
Contributors
Huge thanks to @davidlghellin for your contributions!
0.4.2
November 13, 2025
- Added support for column mapping for Delta Lake (#985).
- Added support for time travel for Iceberg (#1039).
- Added support for Unity Catalog (#1005).
- Improved Iceberg integration (#1006, #1009, and #1042).
- Added the
luhn_checkSQL function (#909). - Improved the following SQL functions (#909 and #1024):
bit_countbit_getgetbitcrc32shasha1expm1pmodwidth_bucketbitmap_countto_date
- Added the
try_avgSQL aggregate function (#1012). - Added support for the
try_sumandtry_avgSQL aggregate functions in window expressions (#1040).
Contributors
Huge thanks to @davidlghellin for the contribution!
0.4.1
November 2, 2025
- Added support for writing partitioned Iceberg tables (#1003).
- Added the
try_sumSQL aggregate function (#960). - Fixed a filter pushdown performance issue (#1008).
Contributors
Huge thanks to @davidlghellin for the contribution!
0.4.0
October 29, 2025
- Added basic support for reading and writing Iceberg tables (#944, #987, #976, #994, and #997).
- Added support for Iceberg REST catalog (#961, #974, #993, and #995).
- Improved Delta Lake integration (#921).
- Added support for multiple arguments for the
count_distinctSQL function (#957). - Added guide for HDFS Kerberos authentication (#992).
- Updated a few execution configuration options (#975).
- Fixed a cost estimation issue with the join reorder optimizer (#969).
Contributors
Huge thanks to @SparkApplicationMaster, @davidlghellin, and @zemin-piao (first-time contributor) for the contributions!
0.3.7
October 3, 2025
- Improved error reporting for the SQL parser (#938).
- Added support for the
DataFrame.unpivot()method in the Spark DataFrame API (#948).
Contributors
Huge thanks to @SparkApplicationMaster for the continued contributions!
0.3.6
September 30, 2025
- Added support for the binary file format (#853).
- Implemented an experimental join reorder physical optimizer using the DPhyp algorithm (#810 and #917). This optimizer is not enabled by default but can be enabled via configuration options.
- Added support for file metadata caching to improve read performance for the Parquet data source (#928).
- Added support for the PySpark UDF
applyInArrow()method in the Spark DataFrame API for grouped and cogrouped data (#886 and #887). - Added support for time travel for Delta Lake (#854).
- Added support for the delete operation for Delta Lake (#856).
- Improved Delta Lake integration (#848 and #916).
- Added support for the following SQL functions (#820, #841, #824, #843, #835, #855, #859, and #860):
eltinlineinline_outertry_parse_urlstackmake_dt_intervalversionmonths_betweenusersession_user
- Improved the following SQL functions (#841, #847, #878, #920, #926, and #914):
arraytry_multiplymap_from_arraysmap_from_entriesapprox_count_distinct
- Added support for using all aggregate functions in window expressions (#861).
- Fixed issues with sorting by aggregate expressions (#915).
- Fixed issues with session key generation when the user ID is missing on the Windows platform (#849).
- Continued the work for data streaming support (#832).
- Added batch view creation endpoints in the MCP server (#875).
- Added an example of using Kustomize with pod templates for Sail workers (#833).
- Fixed input repartitioning issues for PySpark UDTFs (#662).
- Fixed issues with the
DataFrame.replace()method in the Spark DataFrame API (#891). - Added support for the
REALdata type in the SQL parser (#892). - Fixed various literal parsing issues in the SQL parser (#868, #872, and #873).
- Fixed issues with PySpark UDFs with no arguments (#895).
Contributors
Huge thanks to @SparkApplicationMaster, @davidlghellin, and @rafafrdz for the continued contributions!
0.3.5
September 5, 2025
- Fixed issues with writing partitioned data to Delta Lake tables (#837).
- Improved type inference for
NULLmap values in theVALUESSQL clause (#829).
Contributors
Huge thanks to @SparkApplicationMaster for the continued contributions!
0.3.4
September 3, 2025
- Added support for the text file format (#737 and #813).
- Added support for the Spark DataFrame streaming API and added a few data sources/sinks for testing purposes (#751). This provides a foundation for streaming support in Sail but is not ready for general use yet.
- Improved the internals of the Delta Lake integration (#768 and #794).
- Improved idle session handling (#761 and #818).
- Fixed performance issues with the
DataFrame.show()method in the Spark DataFrame API (#790). - Fixed issues with reading and writing compressed files (#760).
- Fixed SQL parsing issues with negated predicates (#776).
- Fixed issues with the
DataFrame.withColumnsRenamed()method in the Spark DataFrame API (#764). - Fixed issues with the
DataFrame.withColumns()method in the Spark DataFrame API (#814). - Added support for the following SQL functions (#727, #682, #774, #777, #779, #787, #762, #795, and #798):
try_modmake_intervalmap_entriesmap_from_entriesmap_concatstr_to_mapwidth_bucketregexp_instr
- Improved the following SQL functions (#682, #767, #769, #777, #722, #785, #789, #795, #801, and #806):
try_addtry_dividetry_multiplytry_subtractnth_valuemedianmapmap_from_arrayssplitelement_attry_element_atpositionlocateget_json_objectjson_object_keyscollect_list
- Improved a few window functions to return the correct types of integers (#765).
- Improved the implementation of array functions (#786).
- Improved the implementation of string functions (#798).
Contributors
Huge thanks to @SparkApplicationMaster, @davidlghellin, and @rafafrdz for the continued contributions!
0.3.3
August 14, 2025
- Fixed issues with physical planning to avoid performance degradation when querying Delta Lake tables (#750).
- Fixed issues with the
Catalog.getTable()method in the Spark DataFrame API (#752). - Added support for
NaNvalues inVALUES(#739). - Fixed issues with the
parquet.bloom_filter_on_writeconfiguration option not being respected (#735). - Added support for the following SQL functions (#670 and #725):
try_to_numberconvert_timezonemake_timestamp_ltz
- Improved the following SQL functions (#725, #730, #734, #743, #754, and #756):
make_timestampfrom_utc_timestampto_utc_timestampskewnesskurtosislnloglog10log1plog2acosacoshasinasinhatanatan2atanhcbrtcoscoshcotcscdegreesexpradianssecsinsinhsqrttantanhjson_array_lengthmap_contains_key
Contributors
Huge thanks to @SparkApplicationMaster and @rafafrdz for the continued contributions!
0.3.2
August 8, 2025
- Added support for reading and writing Delta Lake tables (#578, #634, #677, #680, #716, #717, and #723).
- Added support for Azure storage services and Google Cloud Storage (GCS), and improved support for S3 (#616 and #706).
- Added support for file listing cache and file statistics cache (#709 and #712).
- Added support for the following SQL functions and operators (#529, #580, #633, #645, #638, #654, #539, #661, #629, #676, #672, #635, #683, #702, #698, #708, #713, and #719):
from_csvbroundconvcscsecbit_countbit_getgetbitshiftrightunsigned>>>~array_insertlistaggstring_aggparse_urlurl_decodeurl_encodebitmap_bit_positionbitmap_bucket_numberbitmap_countto_numberto_utc_timestamptry_addtry_dividetry_multiplytry_subtractmonthnamearrays_zipis_valid_utf8try_validate_utf8validate_utf8make_valid_utf8
- Added support for the
Column.try_cast()method in the Spark DataFrame API (#694). - Improved the following SQL functions (#609, #613, #619, #621, #623, #617, #640, #644, #642, #643, #647, #660, #666, #674, and #701):
date_partdatepartextractnullifzerozeroifnullarray_containsarray_positionarray_appendarray_prependarray_sizecardinalitysizearray_aggcollect_setflattenarrays_overlapconcatmapltrimrtrimtrimavgto_unix_timestamp
- Fixed issues with the
DataFrame.na.drop()andDataFrame.dropna()methods in the Spark DataFrame API (#693). - Fixed issues with casting timestamp and interval values from and to numeric values (#691).
- Fixed incorrect eager execution behavior of the
CASEexpression (#649). - Fixed issues with PySpark UDF and UDTF execution (#652 and #658).
- Fixed issues with expression naming (#668 and #685).
- Improved the implementation of SQL math functions (#699).
- Improved the internals of catalog management, data reader, and data writer (#592, #615, #628, #632, #681, #688, #705, and #707).
Contributors
Shoutout to @SparkApplicationMaster for contributions across bug fixes, features, and enhancements! Huge thanks to @rafafrdz, @davidlghellin, @anhvdq (first-time contributor), and @jamesfricker (first-time contributor), for helping to further extend our parity with Spark SQL functions!
0.3.1
July 7, 2025
- Added support for the following SQL functions (#570, #571, #582, #585, and #586):
daynamenullifzerozeroifnullsplit(partial support)collect_setcount_if
- Fixed issues with the
from_utc_timestampSQL function (#596). - Added support for the
DataFrame.sampleBy()method in the Spark DataFrame API (#547). - Added support for the following SQL statements (#588):
SHOW COLUMNSSHOW DATABASESSHOW TABLESSHOW VIEWS
- Improved data source listing performance (#579).
- Improved the internal logic of data source options (#587 and #598).
- Updated gRPC server TCP and HTTP configuration (#593).
Contributors
Huge thanks to @SparkApplicationMaster for the first contributions related to SQL functions! Huge thanks to @davidlghellin for the continued contributions related to the Spark DataFrame API!
0.3.0
June 28, 2025
The 0.3.0 release introduces support for Spark 4.0 in Sail, alongside the existing support for Spark 3.5. One of the most notable changes in Spark 4.0 is the new pyspark-client package, a lightweight PySpark client. When using Sail in your PySpark applications, you can now choose to install this client package, instead of the full pyspark package that includes all the JAR files.
Here is a summary of the new features and improvements in this release.
- Improved remote data access performance by caching object stores (#515).
- Added support for data reader and writer configuration (#466 and #535).
- Added support for the following SQL functions (#527):
crc32shasha1
- Fixed issues with casting integers to timestamps (#533).
- Fixed issues with the
randomandrandnSQL functions (#530). - Added support for the
DataFrame.sample()method in the Spark DataFrame API (#496). - Added support for Spark 4.0 (#467, #498, and #559).
- Updated the default value of a few configuration options (#565).
Breaking Changes
The spark "extra" has been removed from the pysail package. As a result, you can no longer use commands like pip install pysail[spark] to install Sail along with Spark. Instead, you must install the PySpark package separately in your Python environment.
This change allows you to freely choose the PySpark version when using Sail. Depending on your requirements, you can opt for either the pyspark package (Spark 3.5 or later) or the pyspark-client package (introduced in Spark 4.0).
Contributors
We are thrilled by the growing interest from the community. Huge thanks to @rafafrdz, @davidlghellin, @lonless9, and @pimlie for making their first contributions to Sail!
0.2.6
May 14, 2025
- Improved temporal data type casting and display (#448).
- Corrected the time unit for reading
INT96timestamp data from Parquet files (#444). - Fixed issues with column metadata in the Spark DataFrame API (#447).
- Added support for referring to aliased aggregation expressions in Spark SQL
GROUP BYandHAVINGclauses (#456). - Added support for more data formats and added directory listing endpoints in the MCP server (#455 and #458).
0.2.5
April 22, 2025
- Corrected Spark session default time zone configuration and fixed various issues for timestamp data types (#438).
- Improved object store setup and cluster mode task execution (#432).
0.2.4
April 10, 2025
- Improved MCP server logging (#421).
- Improved AWS S3 data access (#426).
- Added support for AWS credential caching (#430).
- Fixed issues with cluster mode task execution (#429).
- Added support for
exceptAll()andtail()in the Spark DataFrame API (#417).
0.2.3
March 21, 2025
- Implemented MCP (Model Context Protocol) server (#410).
- Added support for the
hf://protocol for reading Hugging Face datasets (#412). - Added support for glob patterns in data source URLs (#415).
- Added support for a few data reader and writer options for CSV files (#414).
- Fixed a few issues with SQL temporary views (#413).
- Improved task error reporting in cluster mode (#409).
0.2.2
March 6, 2025
- Switched to the built-in SQL parser (#338, #358, #359, and #376).
- Added support for the majority of Spark SQL syntax (#378, #380, #382, #385, #387, #389, and #390).
- Expanded support for Spark SQL functions (#364, #384, and #391).
- Fixed issues with
join()in the Spark DataFrame API (#392). - Added support for
NATURAL JOINin Spark SQL (#396). - Fixed an issue with SQL window expressions (#386).
- Fixed result parity issues with derived TPC-DS queries (#393).
0.2.1
January 15, 2025
- Added support for SQL table functions and lateral views (#326 and #327).
- Added support for PySpark UDTFs (#329).
- Improved literal and data type support (#317, #328, #330, and #339).
- Added support for
ANTI JOINandSEMI JOIN(#337). - Fixed a few PySpark UDF issues (#343).
- Added support for nested fields in SQL (#340).
- Added support for more queries in the derived TPC-DS benchmark (#346).
- Added support for more datetime functions (#349).
0.2.0
December 3, 2024
We are excited to announce the first Sail release with the distributed processing capability. Spark SQL and DataFrame queries can now run on Kubernetes, powered by the Sail distributed compute engine. We also introduced a new Sail CLI and a configuration mechanism that will serve as the entrypoint for all Sail features moving forward.
We continued extending coverage for Spark SQL functions and the Spark DataFrame API. The changes are listed below.
- Added support for the following DataFrame and SQL functions (#278 and #305).
DataFrame.crosstabDataFrame.replaceDataFrame.toreverseaes_decryptaes_encrypttry_aes_decryptbase64unbase64weekofyear
- Added support for
mapInPandas()andmapInArrow()for Spark DataFrame (#310). - Added support for
applyInPandas()for grouped and co-grouped Spark DataFrame (#313).
Breaking Changes
This release comes with the new Sail CLI, and the way to launch the Spark Connect server and PySpark shell is different from the 0.1.x versions. Please refer to the Getting Started page for the updated instructions.
0.1.7
November 1, 2024
- Expanded support for Spark DataFrame functions (#268 and #261). Added full parity and coverage for the following DataFrame and SQL functions.
DataFrame.summaryDataFrame.describeDataFrame.corrDataFrame.covDataFrame.statDataFrame.dropcorrregr_avgx
- Fixed most issues with
ORDER BYin the derived TPC-DS benchmark, bringing total coverage to 74 out of the 99 queries (#261).
We also made significant changes to the Sail internals to support distributed processing. We are targeting the 0.2.0 release in the next few weeks for an MVP (minimum viable product) of this exciting feature. Please stay tuned! If you are interested in the ongoing work, you can follow #246 in our GitHub repository to get the latest updates!
0.1.6
October 23, 2024
0.1.5
October 17, 2024
- Expanded support for Spark SQL syntax and functions (#239 and #247). Added full parity and coverage for the following SQL functions.
current_catalogcurrent_databasecurrent_schemahashhexunhexxxhash64unix_timestamp
- Fixed a few issues with
JOIN(#250).
0.1.4
October 03, 2024
- Enabled Avro in DataFusion (#234).
- Expanded support for Spark SQL syntax and functions (#213 and #207). Added full parity and coverage for the following SQL functions.
arraydate_formatget_json_objectjson_array_lengthoverlayreplacesplit_partto_dateany_valueapprox_count_distinctcurrent_timezonefirst_valuegreatestlastlast_valueleastmap_contains_keymap_keysmap_valuesmin_bysubstrsum_distinct
- Added support for HDFS (#196).
- Added support for parsing value prefixes followed by whitespace (#218 and lakehq/sqlparser-rs#6).
- Added basic support for Python UDAF (#214).
Contributors
Huge thanks to our first community contributor, @skewballfox for adding support for HDFS!!
0.1.3
September 18, 2024
- Added support for column positions in
GROUP BYandORDER BY(#205). - Expanded support for
INSERTstatements (#195). - Fixed issues with Spark configuration (#192).
- Expanded support for
CREATEandREPLACEstatements (#183). - Added support for
GROUPING SETSaggregation (#184). - Integrated fastrace for more performant logging and tracing (#166).
- Enabled gzip and zstd compression in Tonic (#166).
0.1.2
September 10, 2024
- Fixed issues with aggregation queries.
- Extended support for SQL functions.
- Added support for temporary views and global temporary views.
0.1.1
September 03, 2024
- Extended support for SQL statements and SQL functions.
- Fixed a performance issue for the PySpark DataFrame
show()method.
0.1.0
August 29, 2024
This is the first Sail release.
