Running Benchmarks
INFO
This is an experimental process. The script used in this page may be removed in the future, when there is a better way to run benchmarks.
You can use the following command to run individual queries from the derived TPC-H benchmark.
python -m pysail.examples.spark.tpch \
--data-path "$BENCHMARK_PATH"/tpch/data \
--query-path "$BENCHMARK_PATH"/tpch/queries \
--query 1You need to run this in a Python virtual environment with PySail and PySpark installed. If you are in the project source directory, you can use Hatch to set up the environment. Run hatch run maturin develop to build and install PySail for the default Hatch environment, and then run hatch shell to enter the environment.
BENCHMARK_PATH is the absolute path of the DataFusion benchmarks repository. Please follow the README in that repository to generate the TPC-H data.
- You can use
--consoleinstead of--query [N]to start a PySpark console and explore the TPC-H data interactively. - You can use
--urlto specify the Spark URL. Uselocalfor Spark local mode (JVM) and usesc://localhost:50051for Spark Connect (the default). - You can use
--helpto see all the supported arguments.
Please use the following command to run the Spark Connect server. It runs the server in release mode without debug logs.
env RUST_LOG=info BENCHMARK=1 scripts/spark-tests/run-server.sh