Skip to content

The managed Apache Flink service on the LakeSail platform is deprecated.

LakeSail is building Sail, an open-source computation framework in Rust to seamlessly integrate stream-processing, batch-processing, and compute-intensive (AI) workloads. The LakeSail platform will offer the managed solution for Sail. Existing PySpark and Flink SQL workloads can be migrated with ease. Please stay tuned and contact us if you are interested!

Python Application Packaging

This is a follow-up to the Kafka Data Processing in Python tutorial. It demonstrates how to package a Python application with dependencies and submit it to LakeSail.

We use Poetry for Python dependency management and packaging.

Preparing the Project

Please clone the project repository and start the Docker Compose services by following the Kafka Data Processing in Python tutorial.

Let's first take a look at the pyproject.toml in the flink-samples/kafka-python project directory. The following sections are important in particular.

toml
[tool.poetry.dependencies]
python = "~3.10"
jinja2 = "^3.1.2"

[tool.poetry.group.dev.dependencies]
isort = "^5.12.0"
black = "^23.9.1"
flake8 = "^6.1.0"
apache-flink = "1.17.1"

You can see that jinja2 is defined as a dependency, while apache-flink is defined as a dev dependency. This is helpful when packaging dependencies as a ZIP file, since we want to exclude the PyFlink library provided by the Flink Docker image already.

Building and Deploying Artifacts

  1. Run poetry build in the project directory. This will generate build artifacts in the dist directory.
  2. Run the following commands to create a ZIP file containing the dependencies of the application.
    bash
    poetry export -o dist/requirements.txt
    pip install -r dist/requirements.txt -t dist/deps
    cd dist/deps
    zip -r ../deps.zip *
    cd -

    INFO

    This ZIP file will not contain apache-flink, which is defined as a dev dependency in pyproject.toml. The PyFlink library will be provided by the Flink Docker image at runtime.

  3. In the MinIO console, upload the following files to the kafka-python bucket. Do not include the directory name (dist/) in the object key.
    • dist/kafka_python-0.1.0-py3-none-any.whl
    • dist/deps.zip

Submitting the Application

Run the following command to submit the Flink application to the DEFAULT workspace in your local LakeSail server. The command is similar to the one in Kafka Data Processing in Python, but the differences are highlighted.

bash
curl -i -X POST \
  "http://localhost:18080/api/flink/v1/workspaces/DEFAULT/applications" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer ${LAKESAIL_API_KEY}" \
  -d @- <<EOF
{
  "name": "kafka-python",
  "runtime": "FLINK_1_17",
  "jars": ["s3://kafka-python/flink-sql-connector-kafka-1.17.1.jar"],
  "plugins": [
    {
      "name": "flink-s3-hadoop",
      "jars": ["/opt/flink/opt/flink-s3-fs-hadoop-1.17.1.jar"]
    }
  ],
  "flinkConfiguration": {
    "fs.s3a.endpoint": "http://minio:19000",
    "fs.s3a.path.style.access": "true",
    "fs.s3a.access.key": "minioadmin",
    "fs.s3a.secret.key": "miniopwd",
    "python.files": "s3://kafka-python/kafka_python-0.1.0-py3-none-any.whl,s3://kafka-python/deps.zip",
    "taskmanager.numberOfTaskSlots": "2"
  },
  "logConfiguration": {},
  "job": {
    "entryClass": "org.apache.flink.client.python.PythonDriver",
    "args": ["-pym", "demo.basic"]
  },
  "jobManagerResource": {
    "cpu": 0.5,
    "memory": "1024m"
  },
  "taskManagerResource": {
    "cpu": 0.5,
    "memory": "1024m"
  }
}
EOF

You can now follow the same steps in Kafka Data Processing in Python to test the application.

Further Reading

For more details about Python dependency management, please refer to the Submitting PyFlink Jobs guide.