Configuration
This page lists all the available configuration options for Sail.
WARNING
The Sail configuration system is not stable yet.
For options with the ⚠ note, breaking changes can happen across versions without notice.
For all other options, breaking changes can happen across minor versions for the 0.x releases. Such changes will be documented in the changelog.
INFO
The default value of each option is shown as its string representation that can be used directly in environment variables.
Core Options
mode
SAIL_MODE
local
The running mode for Sail. Valid values are local
, local-cluster
, and kubernetes-cluster
.
- In
local
mode, Sail runs in a single process, while query execution is still parallelized via threads. - In
local-cluster
mode, Sail starts a cluster within a single process. The driver and workers run on different threads in the same process and communicate with each other via RPC. - In
kubernetes-cluster
mode, Sail manages a cluster in Kubernetes. The driver and workers run in separate pods and communicate with each other via RPC.
Cluster Options
cluster.driver_external_host
SAIL_CLUSTER__DRIVER_EXTERNAL_HOST
127.0.0.1
The external host for the worker to connect to the driver.
cluster.driver_external_port
SAIL_CLUSTER__DRIVER_EXTERNAL_PORT
0
The external port for the worker to connect to the driver.
If the value is 0
, the port is assumed to be the same as the port
on which the driver listens.
cluster.driver_listen_host
SAIL_CLUSTER__DRIVER_LISTEN_HOST
127.0.0.1
The host on which the driver listens.
cluster.driver_listen_port
SAIL_CLUSTER__DRIVER_LISTEN_PORT
0
The port on which the driver listens.
If the value is 0
, a random port is assigned by the operating system.
cluster.enable_tls
SAIL_CLUSTER__ENABLE_TLS
false
Whether to enable TLS for cluster communication.
cluster.job_output_buffer
SAIL_CLUSTER__JOB_OUTPUT_BUFFER
16
The number of batches to buffer in the job output stream.
cluster.task_launch_timeout_secs
SAIL_CLUSTER__TASK_LAUNCH_TIMEOUT_SECS
300
The timeout in seconds for launching a task.
cluster.worker_external_host
SAIL_CLUSTER__WORKER_EXTERNAL_HOST
127.0.0.1
The external host for other workers to connect to the worker.
cluster.worker_external_port
SAIL_CLUSTER__WORKER_EXTERNAL_PORT
0
The external port for other workers to connect to the worker.
If the value is 0
, the port is assumed to be the same as the port
on which the worker listens.
cluster.worker_heartbeat_interval_secs
SAIL_CLUSTER__WORKER_HEARTBEAT_INTERVAL_SECS
30
The interval in seconds for worker heartbeats.
cluster.worker_heartbeat_timeout_secs
SAIL_CLUSTER__WORKER_HEARTBEAT_TIMEOUT_SECS
120
The timeout in seconds for worker heartbeats.
cluster.worker_initial_count
SAIL_CLUSTER__WORKER_INITIAL_COUNT
4
The initial number of workers to launch.
cluster.worker_launch_timeout_secs
SAIL_CLUSTER__WORKER_LAUNCH_TIMEOUT_SECS
300
The timeout in seconds for launching a worker.
cluster.worker_listen_host
SAIL_CLUSTER__WORKER_LISTEN_HOST
127.0.0.1
The host on which the worker listens.
cluster.worker_listen_port
SAIL_CLUSTER__WORKER_LISTEN_PORT
0
The port on which the worker listens.
If the value is 0
, a random port is assigned by the operating system.
cluster.worker_max_count
SAIL_CLUSTER__WORKER_MAX_COUNT
0
The maximum number of workers that can be launched.
cluster.worker_max_idle_time_secs
SAIL_CLUSTER__WORKER_MAX_IDLE_TIME_SECS
120
The maximum idle time in seconds before a worker is removed.
cluster.worker_stream_buffer
SAIL_CLUSTER__WORKER_STREAM_BUFFER
16
The number of batches to buffer in the worker shuffle stream.
cluster.worker_task_slots
SAIL_CLUSTER__WORKER_TASK_SLOTS
8
The maximum number of tasks that can be launched on a worker.
Kubernetes Options
kubernetes.driver_pod_name
SAIL_KUBERNETES__DRIVER_POD_NAME
The name of the pod that runs the driver, or empty if the driver pod name is not known. This is used to set owner references for worker pods.
kubernetes.image
SAIL_KUBERNETES__IMAGE
sail:latest
The container image to use for the driver and worker pods.
kubernetes.image_pull_policy
SAIL_KUBERNETES__IMAGE_PULL_POLICY
IfNotPresent
The image pull policy for the driver and worker pods.
kubernetes.namespace
SAIL_KUBERNETES__NAMESPACE
default
The Kubernetes namespace in which the driver and worker pods will be created.
kubernetes.worker_pod_name_prefix
SAIL_KUBERNETES__WORKER_POD_NAME_PREFIX
sail-worker-
The prefix of the name of worker pods.
This should usually end with a hyphen (-
).
kubernetes.worker_service_account_name
SAIL_KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME
default
The name of the service account to use for the worker pods.
Runtime Options
runtime.enable_secondary
SAIL_RUNTIME__ENABLE_SECONDARY
false
Whether to enable a secondary Tokio runtime for separating I/O and compute tasks.
runtime.stack_size
SAIL_RUNTIME__STACK_SIZE
8388608
The stack size in bytes for each thread.
Spark Options
spark.session_timeout_secs
SAIL_SPARK__SESSION_TIMEOUT_SECS
300
The Spark session timeout in seconds.
Execution Options
execution.batch_size
SAIL_EXECUTION__BATCH_SIZE
16384
The batch size for physical plan execution.
Parquet Options
parquet.allow_single_file_parallelism
SAIL_PARQUET__ALLOW_SINGLE_FILE_PARALLELISM
true
(Writing) Whether to parallelize writing for each single Parquet file.
If the value is true
, each column in each row group in each file are serialized in parallel.
parquet.binary_as_string
SAIL_PARQUET__BINARY_AS_STRING
false
(Reading) Whether to read binary columns as string columns when reading Parquet files.
If the value is true
, the Parquet reader will read columns of
the Binary
or LargeBinary
as the Utf8
type, and the BinaryView
type as the Utf8View
type.
This is helpful when reading Parquet files generated by some legacy writers, which do not correctly set
the UTF-8 flag for strings, causing string columns to be loaded as binary columns by default.
parquet.bloom_filter_fpp
SAIL_PARQUET__BLOOM_FILTER_FPP
0.05
(Writing) The false positive probability for bloom filters when writing Parquet files.
parquet.bloom_filter_ndv
SAIL_PARQUET__BLOOM_FILTER_NDV
1000000
(Writing) The number of distinct values for bloom filters when writing Parquet files.
parquet.bloom_filter_on_read
SAIL_PARQUET__BLOOM_FILTER_ON_READ
true
(Writing) Whether to use available bloom filters when reading Parquet files.
parquet.bloom_filter_on_write
SAIL_PARQUET__BLOOM_FILTER_ON_WRITE
false
(Writing) Whether to write bloom filters for all columns when writing Parquet files.
parquet.compression
SAIL_PARQUET__COMPRESSION
zstd(3)
(Writing) The default Parquet compression codec.
Valid values are uncompressed
, snappy
, gzip(level)
,
lzo
, brotli(level)
, lz4
, zstd(level)
, and lz4_raw
,
where level
is an integer defining the compression level.
These values are not case-sensitive.
parquet.data_page_row_count_limit
SAIL_PARQUET__DATA_PAGE_ROW_COUNT_LIMIT
20000
(Writing) The best-effort maximum number of rows in data page for the Parquet writer.
parquet.dictionary_enabled
SAIL_PARQUET__DICTIONARY_ENABLED
true
(Writing) Whether to enable dictionary encoding for the Parquet writer.
parquet.dictionary_page_size_limit
SAIL_PARQUET__DICTIONARY_PAGE_SIZE_LIMIT
1048576
(Writing) The best-effort maximum dictionary page size in bytes for the Parquet writer.
parquet.enable_page_index
SAIL_PARQUET__ENABLE_PAGE_INDEX
true
(Reading) Whether to enable page index when reading Parquet files.
If the value is true
, the Parquet reader reads the page index if present.
This can reduce I/O and the number of rows decoded.
parquet.max_row_group_size
SAIL_PARQUET__MAX_ROW_GROUP_SIZE
1048576
(Writing) The target maximum number of rows in each row group for the Parquet writer.
Larger row groups require more memory to write, but can get better compression and be faster to read.
parquet.maximum_buffered_record_batches_per_stream
SAIL_PARQUET__MAXIMUM_BUFFERED_RECORD_BATCHES_PER_STREAM
32
(Writing) The maximum number of buffered record batches per stream for the Parquet writer.
This may improve performance when writing large Parquet files, at the expense of higher memory usage.
parquet.maximum_parallel_row_group_writers
SAIL_PARQUET__MAXIMUM_PARALLEL_ROW_GROUP_WRITERS
2
(Writing) The maximum number of row group writers to use for the Parquet writer.
This may improve performance when writing large Parquet files, at the expense of higher memory usage.
parquet.metadata_size_hint
SAIL_PARQUET__METADATA_SIZE_HINT
0
(Reading) The metadata size hint in bytes when reading Parquet files.
If the value n
is greater than 8
, the Parquet reader will try and fetch the last n
bytes of the Parquet file optimistically. Otherwise, two reads are performed to fetch the
metadata. The first read fetches the 8-byte Parquet footer and the second read fetches
the metadata length encoded in the footer.
parquet.pruning
SAIL_PARQUET__PRUNING
true
(Reading) Whether to prune row groups when reading Parquet files.
If the value is true
, the Parquet reader attempts to skip entire row groups based
on the predicate in the query and the metadata (minimum and maximum values) stored in
the Parquet file.
parquet.pushdown_filters
SAIL_PARQUET__PUSHDOWN_FILTERS
false
(Reading) Whether to push down filter expressions when reading Parquet files.
If the value is true
, the Parquet reader applies filter expressions in decoding operations to
reduce the number of rows decoded. This optimization is sometimes called "late materialization".
parquet.reorder_filters
SAIL_PARQUET__REORDER_FILTERS
false
(Reading) Whether to reorder filter expressions when reading Parquet files.
If the value is true
, the Parquet reader reorders filter expressions heuristically in decoding operations to
minimize the cost of evaluation. If the value is false
, the filters are applied in the same order as written in the query.
parquet.schema_force_view_types
SAIL_PARQUET__SCHEMA_FORCE_VIEW_TYPES
true
(Reading) Whether to force view types for binary and string columns when reading Parquet files.
If the value is true
, the Parquet reader will read columns of the Utf8
or Utf8Large
types as the Utf8View
type,
and the Binary
or BinaryLarge
types as the BinaryView
type.
parquet.skip_arrow_metadata
SAIL_PARQUET__SKIP_ARROW_METADATA
false
(Writing) Whether to skip encoding the embedded arrow metadata when writing Parquet files.
parquet.skip_metadata
SAIL_PARQUET__SKIP_METADATA
true
(Reading) Whether to skip the metadata when reading Parquet files.
If the value is true
, the Parquet reader skip the optional embedded metadata that may be in
the file schema. This can help avoid schema conflicts when querying
multiple Parquet files with schemas containing compatible types but different metadata.
parquet.statistics_enabled
SAIL_PARQUET__STATISTICS_ENABLED
page
(Writing) Whether statistics are enabled for any column for the Parquet writer.
Valid values are none
, chunk
, and page
.
These values are not case-sensitive.
parquet.write_batch_size
SAIL_PARQUET__WRITE_BATCH_SIZE
1024
(Writing) The Parquet writer batch size in bytes.