Why Sail?
Today's cloud environments and data workloads pose challenges not anticipated by solutions developed a decade ago. Organizations choose Sail because it accelerates execution, reduces resource consumption, simplifies data infrastructure, and supports straightforward migration.
Performance
Sail provides predictable performance characteristics across workloads.
Sail uses Apache Arrow for CPU cache utilization and supports vectorized operations via SIMD instructions. The columnar memory layout offers better performance than the row-oriented data models in Apache Spark or Apache Flink.
Without the JVM, Sail has no GC (garbage collection) overhead during query execution. Latency spikes caused by GC pauses will also be removed when Sail supports data streaming in the near future.
Python UDFs (user-defined functions) perform well in Sail. The PyO3 library embeds a Python interpreter in the Sail process. The Arrow format allows zero-copy data sharing between Sail and Python, making your Python code a native part of Sail.
Memory Efficiency
Rust's zero-cost abstractions allow for modular Sail internals with a low memory footprint. The Sail process starts within seconds and uses only a few dozen megabytes of memory when idle. This means you can scale Sail workers quickly as the load increases.
There is no need for JVM tuning anymore. You no longer need to worry about memory usage from overhead in JVM objects or squeeze performance out of Spark memory configuration.
In our Benchmark Results, Sail delivers a 4x speed-up over Apache Spark and reduces hardware costs by up to 94% due to the combined effect of shorter query execution times and lower memory usage.
Robustness
Sail benefits from Rust's approach to memory management. The ownership rules and reference lifetimes enforced at compile time eliminate whole categories of memory bugs. Combined with libraries such as Tokio, Sail gets fearless concurrency, meaning that safe async code is a natural ingredient of Sail internals. The end result is a correct and performant compute engine runtime that you can trust.
Compatibility
Sail provides a drop-in replacement for Spark SQL and the Spark DataFrame API. Your Spark client session communicates with the Sail server over gRPC via the Spark Connect protocol.
Sail treats compatibility with Spark seriously. If there is a behavior mismatch between Sail and Spark, we consider it a bug. As you explore the documentation, you will find that Sail already supports most common usages of Spark. Our supported features keep expanding toward full parity with Spark.
Simplicity
The sail command-line interface (CLI) is the single entry point for all Sail commands. The CLI is available either by installing the pysail Python library or building the standalone binary from source. You can also use the Python API to start the Sail server within your PySpark code.
As a unified engine, Sail lets you run ad-hoc SQL queries, execute distributed batch jobs, or preprocess data for AI models within a single environment, removing the need to switch runtimes or move data between systems. We aim for a smooth developer experience as you scale your workloads from your laptop to a production cluster.
