Running Spark Tests
The Spark tests are unit tests collected from the Spark project and are used to measure Spark feature parity for Sail.
The Spark project itself uses the Python unittest module to run the tests. We have developed custom scripts to run the tests using pytest instead.
Preparing the Test Environment
Before running Spark tests, you need to install the patched PySpark package to the test-spark
matrix environments that supports testing against multiple Spark versions.
hatch run test-spark.spark-4.0.0:install-pyspark
hatch run test-spark.spark-3.5.5:install-pyspark
You can choose to run any or all of the commands above, depending on whether you have built the patched PySpark package for the corresponding Spark version.
When you run the command, the environment will be created automatically if it does not exist.
Running the Spark Connect Server
Use the following commands to build and run the Sail Spark Connect server.
hatch run scripts/spark-tests/run-server.sh
Running the Tests
After running the Spark Connect server, start another terminal and use the following command to run the Spark tests.
hatch run test-spark.spark-4.0.0:scripts/spark-tests/run-tests.sh
The command runs a default set of test suites for Spark Connect. Each test suite will write its <suite>.jsonl
and <suite>.log
files to the log directory, where <suite>
is the test suite name.
The test logs will be written to tmp/spark-tests/<name>
where <name>
is defined by the TEST_RUN_NAME
environment variable whose default value is latest
.
Test Selection
You can pass arguments to the script, which will be forwarded to pytest
. You can also use PYTEST_
environment variables to customize the test execution. For example, PYTEST_ADDOPTS="-k <expression>"
can be used to run specific tests matching <expression>
.
hatch run test-spark.spark-4.0.0:env \
TEST_RUN_NAME=selected \
scripts/spark-tests/run-tests.sh \
--pyargs pyspark.sql.tests.connect -v -k "test_sql"
When you customize test execution using the above command, a single test suite will be run, and the test log files are always test.jsonl
and test.log
in the log directory.
Note that for the above command, the test logs are written to the directory tmp/spark-tests/selected
, due to the TEST_RUN_NAME
environment variable.
Running the Tests against JVM Spark
As a comparison, you can run the tests against the original JVM-based Spark library, by setting the SPARK_TESTING_REMOTE_PORT
environment variable to an empty string.
hatch run test-spark.spark-4.0.0:env \
SPARK_TESTING_REMOTE_PORT= \
scripts/spark-tests/run-tests.sh
This can be useful for discovering issues in the test setup.
Caveat: Environment Mismatch
The steps above start the server in the default
Hatch environment. There are a few PySpark UDF tests that would fail in this setup, since they import testing UDFs available only in the patched PySpark library (installed in the test-spark
Hatch environment). There are also a few data-dependent tests that would fail, since the data files in the python/test_support
directory are only available in the patched PySpark library.
Moreover, when the server is started in the default
environment which has the PySpark 4.0.0 library installed, the tests for PySpark 3.5.5 does not work.
To use the same PySpark library for both the server and the tests, run the server and the tests in the same test-spark
environment.
hatch run test-spark.spark-3.5.5:scripts/spark-tests/run-server.sh
hatch run test-spark.spark-3.5.5:scripts/spark-tests/run-tests.sh
However, running the server outside the default
environment pollutes the build cache, so you may notice that the server takes longer to build and start.