Python Application Packaging
This is a follow-up to the Kafka Data Processing in Python tutorial. It demonstrates how to package a Python application with dependencies and submit it to LakeSail.
We use Poetry for Python dependency management and packaging.
Preparing the Project
Please clone the project repository and start the Docker Compose services by following the Kafka Data Processing in Python tutorial.
Let's first take a look at the pyproject.toml
in the flink-samples/kafka-python
project directory. The following sections are important in particular.
[tool.poetry.dependencies]
python = "~3.10"
jinja2 = "^3.1.2"
[tool.poetry.group.dev.dependencies]
isort = "^5.12.0"
black = "^23.9.1"
flake8 = "^6.1.0"
apache-flink = "1.17.1"
You can see that jinja2
is defined as a dependency, while apache-flink
is defined as a dev dependency. This is helpful when packaging dependencies as a ZIP file, since we want to exclude the PyFlink library provided by the Flink Docker image already.
Building and Deploying Artifacts
- Run
poetry build
in the project directory. This will generate build artifacts in thedist
directory. - Run the following commands to create a ZIP file containing the dependencies of the application.bash
poetry export -o dist/requirements.txt pip install -r dist/requirements.txt -t dist/deps cd dist/deps zip -r ../deps.zip * cd -
INFO
This ZIP file will not contain
apache-flink
, which is defined as a dev dependency inpyproject.toml
. The PyFlink library will be provided by the Flink Docker image at runtime. - In the MinIO console, upload the following files to the
kafka-python
bucket. Do not include the directory name (dist/
) in the object key.dist/kafka_python-0.1.0-py3-none-any.whl
dist/deps.zip
Submitting the Application
Run the following command to submit the Flink application to the DEFAULT
workspace in your local LakeSail server. The command is similar to the one in Kafka Data Processing in Python, but the differences are highlighted.
curl -i -X POST \
"http://localhost:18080/api/flink/v1/workspaces/DEFAULT/applications" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer ${LAKESAIL_API_KEY}" \
-d @- <<EOF
{
"name": "kafka-python",
"runtime": "FLINK_1_17",
"jars": ["s3://kafka-python/flink-sql-connector-kafka-1.17.1.jar"],
"plugins": [
{
"name": "flink-s3-hadoop",
"jars": ["/opt/flink/opt/flink-s3-fs-hadoop-1.17.1.jar"]
}
],
"flinkConfiguration": {
"fs.s3a.endpoint": "http://minio:19000",
"fs.s3a.path.style.access": "true",
"fs.s3a.access.key": "minioadmin",
"fs.s3a.secret.key": "miniopwd",
"python.files": "s3://kafka-python/kafka_python-0.1.0-py3-none-any.whl,s3://kafka-python/deps.zip",
"taskmanager.numberOfTaskSlots": "2"
},
"logConfiguration": {},
"job": {
"entryClass": "org.apache.flink.client.python.PythonDriver",
"args": ["-pym", "demo.basic"]
},
"jobManagerResource": {
"cpu": 0.5,
"memory": "1024m"
},
"taskManagerResource": {
"cpu": 0.5,
"memory": "1024m"
}
}
EOF
You can now follow the same steps in Kafka Data Processing in Python to test the application.
Further Reading
For more details about Python dependency management, please refer to the Submitting PyFlink Jobs guide.