Configuration
This page lists all the available configuration options for Sail.
WARNING
The Sail configuration system is not stable yet.
For options with the ⚠ note, breaking changes can happen across versions without notice.
For all other options, breaking changes can happen across minor versions for the 0.x releases. Such changes will be documented in the changelog.
INFO
The default value of each option is shown as its string representation that can be used directly in environment variables.
Core Options
modeSAIL_MODElocalThe running mode for Sail. Valid values are local, local-cluster, and kubernetes-cluster.
- In
localmode, Sail runs in a single process, while query execution is still parallelized via threads. - In
local-clustermode, Sail starts a cluster within a single process. The driver and workers run on different threads in the same process and communicate with each other via RPC. - In
kubernetes-clustermode, Sail manages a cluster in Kubernetes. The driver and workers run in separate pods and communicate with each other via RPC.
Cluster Options
cluster.driver_external_hostSAIL_CLUSTER__DRIVER_EXTERNAL_HOST127.0.0.1The external host for the worker to connect to the driver.
cluster.driver_external_portSAIL_CLUSTER__DRIVER_EXTERNAL_PORT0The external port for the worker to connect to the driver.
If the value is 0, the port is assumed to be the same as the port
on which the driver listens.
cluster.driver_listen_hostSAIL_CLUSTER__DRIVER_LISTEN_HOST127.0.0.1The host on which the driver listens.
cluster.driver_listen_portSAIL_CLUSTER__DRIVER_LISTEN_PORT0The port on which the driver listens.
If the value is 0, a random port is assigned by the operating system.
cluster.enable_tlsSAIL_CLUSTER__ENABLE_TLSfalseWhether to enable TLS for cluster communication.
cluster.job_output_bufferSAIL_CLUSTER__JOB_OUTPUT_BUFFER16The number of batches to buffer in the job output stream.
cluster.rpc_retry_strategy.exponential_backoff.factorSAIL_CLUSTER__RPC_RETRY_STRATEGY__EXPONENTIAL_BACKOFF__FACTOR2The factor by which the delay increases after each retry
when using the exponential_backoff retry strategy.
cluster.rpc_retry_strategy.exponential_backoff.initial_delay_secsSAIL_CLUSTER__RPC_RETRY_STRATEGY__EXPONENTIAL_BACKOFF__INITIAL_DELAY_SECS1The initial delay in seconds between retries for RPC requests
when using the exponential_backoff retry strategy.
cluster.rpc_retry_strategy.exponential_backoff.max_countSAIL_CLUSTER__RPC_RETRY_STRATEGY__EXPONENTIAL_BACKOFF__MAX_COUNT3The maximum number of retries for RPC requests
when using the exponential_backoff retry strategy.
cluster.rpc_retry_strategy.exponential_backoff.max_delay_secsSAIL_CLUSTER__RPC_RETRY_STRATEGY__EXPONENTIAL_BACKOFF__MAX_DELAY_SECS5The maximum delay in seconds between retries for RPC requests
when using the exponential_backoff retry strategy.
cluster.rpc_retry_strategy.fixed.delay_secsSAIL_CLUSTER__RPC_RETRY_STRATEGY__FIXED__DELAY_SECS5The delay in seconds between retries for RPC requests
when using the fixed retry strategy.
cluster.rpc_retry_strategy.fixed.max_countSAIL_CLUSTER__RPC_RETRY_STRATEGY__FIXED__MAX_COUNT3The maximum number of retries for RPC requests
when using the fixed retry strategy.
cluster.rpc_retry_strategy.typeSAIL_CLUSTER__RPC_RETRY_STRATEGY__TYPEfixedThe retry strategy for driver and worker RPC requests.
Valid values are fixed and exponential_backoff.
cluster.task_launch_timeout_secsSAIL_CLUSTER__TASK_LAUNCH_TIMEOUT_SECS120The timeout in seconds for launching a task.
cluster.worker_external_hostSAIL_CLUSTER__WORKER_EXTERNAL_HOST127.0.0.1The external host for other workers to connect to the worker.
cluster.worker_external_portSAIL_CLUSTER__WORKER_EXTERNAL_PORT0The external port for other workers to connect to the worker.
If the value is 0, the port is assumed to be the same as the port
on which the worker listens.
cluster.worker_heartbeat_interval_secsSAIL_CLUSTER__WORKER_HEARTBEAT_INTERVAL_SECS10The interval in seconds for worker heartbeats.
cluster.worker_heartbeat_timeout_secsSAIL_CLUSTER__WORKER_HEARTBEAT_TIMEOUT_SECS120The timeout in seconds for worker heartbeats.
cluster.worker_initial_countSAIL_CLUSTER__WORKER_INITIAL_COUNT4The initial number of workers to launch.
cluster.worker_launch_timeout_secsSAIL_CLUSTER__WORKER_LAUNCH_TIMEOUT_SECS120The timeout in seconds for launching a worker.
cluster.worker_listen_hostSAIL_CLUSTER__WORKER_LISTEN_HOST127.0.0.1The host on which the worker listens.
cluster.worker_listen_portSAIL_CLUSTER__WORKER_LISTEN_PORT0The port on which the worker listens.
If the value is 0, a random port is assigned by the operating system.
cluster.worker_max_countSAIL_CLUSTER__WORKER_MAX_COUNT0The maximum number of workers that can be launched.
cluster.worker_max_idle_time_secsSAIL_CLUSTER__WORKER_MAX_IDLE_TIME_SECS60The maximum idle time in seconds before a worker is removed.
cluster.worker_stream_bufferSAIL_CLUSTER__WORKER_STREAM_BUFFER16The number of batches to buffer in the worker shuffle stream.
cluster.worker_task_slotsSAIL_CLUSTER__WORKER_TASK_SLOTS8The maximum number of tasks that can be launched on a worker.
Kubernetes Options
kubernetes.driver_pod_nameSAIL_KUBERNETES__DRIVER_POD_NAMEThe name of the pod that runs the driver, or empty if the driver pod name is not known. This is used to set owner references for worker pods.
kubernetes.imageSAIL_KUBERNETES__IMAGEsail:latestThe container image to use for the driver and worker pods.
kubernetes.image_pull_policySAIL_KUBERNETES__IMAGE_PULL_POLICYIfNotPresentThe image pull policy for the driver and worker pods.
kubernetes.namespaceSAIL_KUBERNETES__NAMESPACEdefaultThe Kubernetes namespace in which the driver and worker pods will be created.
kubernetes.worker_pod_name_prefixSAIL_KUBERNETES__WORKER_POD_NAME_PREFIXsail-worker-The prefix of the name of worker pods.
This should usually end with a hyphen (-).
kubernetes.worker_pod_templateSAIL_KUBERNETES__WORKER_POD_TEMPLATEIf non-empty, a JSON string representing a Kubernetes Pod template
with the schema of PodTemplateSpec in the Kubernetes API.
kubernetes.worker_service_account_nameSAIL_KUBERNETES__WORKER_SERVICE_ACCOUNT_NAMEdefaultThe name of the service account to use for the worker pods.
Runtime Options
runtime.enable_secondarySAIL_RUNTIME__ENABLE_SECONDARYtrueUse a secondary Tokio runtime to separate object storage I/O tasks from other operations.
runtime.stack_sizeSAIL_RUNTIME__STACK_SIZE8388608The stack size in bytes for each thread.
Spark Options
spark.execution_heartbeat_interval_secsSAIL_SPARK__EXECUTION_HEARTBEAT_INTERVAL_SECS15The interval in seconds for the server to send empty response to the client during long-running operations. The empty response serves as a heartbeat to keep the session active.
spark.session_timeout_secsSAIL_SPARK__SESSION_TIMEOUT_SECS900The duration in seconds allowed for the session to be idle. If the server does not receive any requests from the client after this duration, the session will be removed from the server.
Execution Options
execution.batch_sizeSAIL_EXECUTION__BATCH_SIZE8192The batch size for physical plan execution.
execution.collect_statisticsSAIL_EXECUTION__COLLECT_STATISTICStrueShould statistics be collected when first creating a table. This can slow down the initial DataFrame creation while greatly accelerating queries with certain filters. Has no effect after the table is created.
execution.file_listing_cache.max_entriesSAIL_EXECUTION__FILE_LISTING_CACHE__MAX_ENTRIES10000Maximum number of directory listings to cache.
This setting is only effective when the cache is enabled.
Setting the value to 0 disables the limit.
This setting can only be configured at startup and cannot be changed at runtime.
execution.file_listing_cache.ttlSAIL_EXECUTION__FILE_LISTING_CACHE__TTL1800The time-to-live (TTL) in seconds for cached directory listings.
Entries expire after this duration from when they were cached,
ensuring eventual consistency with the storage system.
This setting is only effective when the cache is enabled.
Setting the value to 0 disables the TTL.
This setting can only be configured at startup and cannot be changed at runtime.
execution.file_listing_cache.typeSAIL_EXECUTION__FILE_LISTING_CACHE__TYPEnoneThe type of cache for file metadata when listing files.
The cache avoids repeatedly listing file metadata,
which may be expensive in certain situations (e.g., when using remote object storage).
When the cache is used, updates to the underlying location may not be visible until
the cache entry expires (controlled by execution.file_listing_cache.ttl).
Valid values are none, global (for a global cache), and session (for a per-session cache).
execution.use_row_number_estimates_to_optimize_partitioningSAIL_EXECUTION__USE_ROW_NUMBER_ESTIMATES_TO_OPTIMIZE_PARTITIONINGfalseShould Sail use row number estimates at the input to decide whether increasing parallelism is beneficial or not. By default, only exact row numbers (not estimates) are used for this decision.
Parquet Options
parquet.allow_single_file_parallelismSAIL_PARQUET__ALLOW_SINGLE_FILE_PARALLELISMtrue(Writing) Whether to parallelize writing for each single Parquet file.
If the value is true, each column in each row group in each file are serialized in parallel.
parquet.binary_as_stringSAIL_PARQUET__BINARY_AS_STRINGfalse(Reading) Whether to read binary columns as string columns when reading Parquet files.
If the value is true, the Parquet reader will read columns of
the Binary or LargeBinary as the Utf8 type, and the BinaryView type as the Utf8View type.
This is helpful when reading Parquet files generated by some legacy writers, which do not correctly set
the UTF-8 flag for strings, causing string columns to be loaded as binary columns by default.
parquet.bloom_filter_fppSAIL_PARQUET__BLOOM_FILTER_FPP0.05(Writing) The false positive probability for bloom filters when writing Parquet files.
parquet.bloom_filter_ndvSAIL_PARQUET__BLOOM_FILTER_NDV1000000(Writing) The number of distinct values for bloom filters when writing Parquet files.
parquet.bloom_filter_on_readSAIL_PARQUET__BLOOM_FILTER_ON_READtrue(Reading) Whether to use available bloom filters when reading Parquet files.
parquet.bloom_filter_on_writeSAIL_PARQUET__BLOOM_FILTER_ON_WRITEfalse(Writing) Whether to write bloom filters for all columns when writing Parquet files.
parquet.column_index_truncate_lengthSAIL_PARQUET__COLUMN_INDEX_TRUNCATE_LENGTH64(Writing) The column index truncate length for the Parquet writer.
parquet.compressionSAIL_PARQUET__COMPRESSIONzstd(3)(Writing) The default Parquet compression codec.
Valid values are uncompressed, snappy, gzip(level),
lzo, brotli(level), lz4, zstd(level), and lz4_raw,
where level is an integer defining the compression level.
These values are not case-sensitive.
parquet.data_page_row_count_limitSAIL_PARQUET__DATA_PAGE_ROW_COUNT_LIMIT20000(Writing) The best-effort maximum number of rows in data page for the Parquet writer.
parquet.data_page_size_limitSAIL_PARQUET__DATA_PAGE_SIZE_LIMIT1048576(Writing) The best-effort maximum size of a data page in bytes.
parquet.dictionary_enabledSAIL_PARQUET__DICTIONARY_ENABLEDtrue(Writing) Whether to enable dictionary encoding for the Parquet writer.
parquet.dictionary_page_size_limitSAIL_PARQUET__DICTIONARY_PAGE_SIZE_LIMIT1048576(Writing) The best-effort maximum dictionary page size in bytes for the Parquet writer.
parquet.enable_page_indexSAIL_PARQUET__ENABLE_PAGE_INDEXtrue(Reading) Whether to enable page index when reading Parquet files.
If the value is true, the Parquet reader reads the page index if present.
This can reduce I/O and the number of rows decoded.
parquet.encodingSAIL_PARQUET__ENCODING(Writing) The default encoding for any column.
Valid values are plain, plain_dictionary, rle,
bit_packed (deprecated), delta_binary_packed, delta_length_byte_array,
delta_byte_array, rle_dictionary, and byte_stream_split.
These values are not case sensitive.
An empty value can also be used, which allows the Parquet writer to choose
the encoding for each column to achieve good performance.
parquet.file_metadata_cache.size_limitSAIL_PARQUET__FILE_METADATA_CACHE__SIZE_LIMIT0(Reading) Maximum size in bytes for the Parquet metadata cache.
Setting the value to 0 disables the limit.
This setting can only be configured at startup and cannot be changed at runtime.
parquet.file_metadata_cache.ttlSAIL_PARQUET__FILE_METADATA_CACHE__TTL1800(Reading) The time-to-live (TTL) in seconds for cached Parquet files metadata.
Entries expire after this duration from when they were cached,
ensuring eventual consistency with the storage system.
This setting is only effective when the cache is enabled.
Setting the value to 0 disables the TTL.
This setting can only be configured at startup and cannot be changed at runtime.
parquet.file_metadata_cache.typeSAIL_PARQUET__FILE_METADATA_CACHE__TYPEglobal(Reading) The type of cache for embedded metadata of Parquet files (footer and page metadata).
This setting avoids repeatedly reading metadata,
which can offer substantial performance improvements for repeated queries over large number of files.
The cache is automatically invalidated when the underlying file is modified.
Valid values are none, global (for a global cache), and session (for a per-session cache).
parquet.file_statistics_cache.max_entriesSAIL_PARQUET__FILE_STATISTICS_CACHE__MAX_ENTRIES10000(Reading) Maximum number of Parquet files statistics to cache.
This setting is only effective when the cache is enabled.
When the limit is reached, least recently used entries are evicted.
Setting the value to 0 disables the limit.
This setting can only be configured at startup and cannot be changed at runtime.
parquet.file_statistics_cache.ttlSAIL_PARQUET__FILE_STATISTICS_CACHE__TTL1800(Reading) The time-to-live (TTL) in seconds for cached Parquet files statistics.
Entries expire after this duration from when they were cached,
ensuring eventual consistency with the storage system.
This setting is only effective when the cache is enabled.
Setting the value to 0 disables the TTL.
This setting can only be configured at startup and cannot be changed at runtime.
parquet.file_statistics_cache.typeSAIL_PARQUET__FILE_STATISTICS_CACHE__TYPEglobal(Reading) The type of cache for files statistics when reading Parquet files.
This setting avoids repeatedly computing statistics,
which may be expensive in certain situations (e.g., when using remote object storage).
The cache is automatically invalidated when the underlying file is modified.
Valid values are none, global (for a global cache), and session (for a per-session cache).
parquet.max_row_group_sizeSAIL_PARQUET__MAX_ROW_GROUP_SIZE1048576(Writing) The target maximum number of rows in each row group for the Parquet writer. Larger row groups require more memory to write, but can get better compression and be faster to read.
parquet.maximum_buffered_record_batches_per_streamSAIL_PARQUET__MAXIMUM_BUFFERED_RECORD_BATCHES_PER_STREAM16(Writing) The maximum number of buffered record batches per stream for the Parquet writer. This may improve performance when writing large Parquet files, at the expense of higher memory usage.
parquet.maximum_parallel_row_group_writersSAIL_PARQUET__MAXIMUM_PARALLEL_ROW_GROUP_WRITERS2(Writing) The maximum number of row group writers to use for the Parquet writer. This may improve performance when writing large Parquet files, at the expense of higher memory usage.
parquet.metadata_size_hintSAIL_PARQUET__METADATA_SIZE_HINT0(Reading) The metadata size hint in bytes when reading Parquet files.
If the value n is greater than 8, the Parquet reader will try and fetch the last n
bytes of the Parquet file optimistically. Otherwise, two reads are performed to fetch the
metadata. The first read fetches the 8-byte Parquet footer and the second read fetches
the metadata length encoded in the footer.
parquet.pruningSAIL_PARQUET__PRUNINGtrue(Reading) Whether to prune row groups when reading Parquet files.
If the value is true, the Parquet reader attempts to skip entire row groups based
on the predicate in the query and the metadata (minimum and maximum values) stored in
the Parquet file.
parquet.pushdown_filtersSAIL_PARQUET__PUSHDOWN_FILTERSfalse(Reading) Whether to push down filter expressions when reading Parquet files.
If the value is true, the Parquet reader applies filter expressions in decoding operations to
reduce the number of rows decoded. This optimization is sometimes called "late materialization".
parquet.reorder_filtersSAIL_PARQUET__REORDER_FILTERSfalse(Reading) Whether to reorder filter expressions when reading Parquet files.
If the value is true, the Parquet reader reorders filter expressions heuristically in decoding operations to
minimize the cost of evaluation. If the value is false, the filters are applied in the same order as written in the query.
parquet.schema_force_view_typesSAIL_PARQUET__SCHEMA_FORCE_VIEW_TYPEStrue(Reading) Whether to force view types for binary and string columns when reading Parquet files.
If the value is true, the Parquet reader will read columns of the Utf8 or Utf8Large types as the Utf8View type,
and the Binary or BinaryLarge types as the BinaryView type.
parquet.skip_arrow_metadataSAIL_PARQUET__SKIP_ARROW_METADATAfalse(Writing) Whether to skip encoding the embedded arrow metadata when writing Parquet files.
parquet.skip_metadataSAIL_PARQUET__SKIP_METADATAtrue(Reading) Whether to skip the metadata when reading Parquet files.
If the value is true, the Parquet reader skip the optional embedded metadata that may be in
the file schema. This can help avoid schema conflicts when querying
multiple Parquet files with schemas containing compatible types but different metadata.
parquet.statistics_enabledSAIL_PARQUET__STATISTICS_ENABLEDpage(Writing) Whether statistics are enabled for any column for the Parquet writer.
Valid values are none, chunk, and page.
These values are not case-sensitive.
parquet.statistics_truncate_lengthSAIL_PARQUET__STATISTICS_TRUNCATE_LENGTH64(Writing) The statistics truncate length for the Parquet writer.
If the value is 0, no truncation is applied.
parquet.write_batch_sizeSAIL_PARQUET__WRITE_BATCH_SIZE1024(Writing) The Parquet writer batch size in bytes.
parquet.writer_versionSAIL_PARQUET__WRITER_VERSION"1.0"(Writing) The Parquet writer version.
Valid values are "1.0" and "2.0".
Catalog Options
catalog.default_catalogSAIL_CATALOG__DEFAULT_CATALOGsailThe name of the default catalog to use.
catalog.default_databaseSAIL_CATALOG__DEFAULT_DATABASE["default"]The name of the default database (namespace) to use. This is a list of strings, where each string is a part of the namespace.
catalog.global_temporary_databaseSAIL_CATALOG__GLOBAL_TEMPORARY_DATABASE["global_temp"]The name of the global temporary database (namespace) to use.
This is a list of strings, where each string is a part of the namespace.
The global temporary database cannot be changed at runtime,
so setting the Spark configuration spark.sql.globalTempDatabase has no effect.
catalog.listSAIL_CATALOG__LIST[{name="sail", type="memory", initial_database=["default"], initial_database_comment="default database"}]The list of catalogs to use. Each catalog is defined by a name and a type, along with optional parameters.
name is used to refer to the catalog name, and type defines the catalog implementation.
Other Options
optimizer.enable_join_reorderSAIL_OPTIMIZER__ENABLE_JOIN_REORDERfalseWhether to enable cost-based join reorder in the query optimizer.
