The Lindorm compute engine exposes a RESTful API for submitting Apache Spark Python jobs. Use it to run streaming and batch tasks, machine learning tasks, and graph computing tasks. This guide walks through the full workflow: defining the job, packaging dependencies, uploading files to Object Storage Service (OSS), and submitting the job.
Prerequisites
Before you begin, ensure that you have:
-
An activated Lindorm compute engine. See Activate the service
-
An OSS bucket to store your Python project files, runtime environment, and launcher script
-
An AccessKey ID and AccessKey secret with read/write access to the OSS bucket. See Create an AccessKey pair
-
A Linux environment (required for packaging Python binary files compatible with the Lindorm compute engine)
How it works
The Spark Python job submission workflow has four stages:
-
Define the job — structure your Python project with the required entry point files.
-
Package the job — bundle your project code and Python runtime environment separately.
-
Upload the files — store all artifacts in OSS.
-
Submit the job — run the job from the Lindorm console or Data Management Service (DMS).
Step 1: Define the job
Download the sample Spark job package and extract it. The extracted folder is named lindorm-spark-examples. Review the lindorm-spark-examples/python directory for reference project layout.
Your project root (your_project in the example) requires three structural changes before it can be submitted.
1. Add `__init__.py`
Create an empty __init__.py file in the your_project directory. This makes the directory a Python module that launcher.py can import.
2. Prepare `main.py`
Open your_project/main.py and make two edits:
Add the project directory to sys.path so imports resolve correctly at runtime:
current_dir = os.path.abspath(os.path.dirname(__file__))
sys.path.append(current_dir)
Wrap your entry logic in a main(argv) function:
def main(argv):
# Write your job logic here
pass
if __name__ == "__main__":
main(sys.argv)
For a working example, the following initializes a SparkSession:
from pyspark.sql import SparkSession
def main(argv):
spark = SparkSession \
.builder \
.appName("PythonImportTest") \
.getOrCreate()
print(spark.conf)
spark.stop()
if __name__ == "__main__":
main(sys.argv)
3. Create `launcher.py`
In the your_project root directory, create a file named launcher.py. Copy the contents from lindorm-spark-examples/python/launcher.py. This file is the entry point that the Lindorm compute engine calls — it adds the project directory to sys.path and calls your main(argv) function.
Step 2: Package the job
Packaging produces two separate artifacts: a .zip (or .egg) file containing your project code, and a tar file containing the Python runtime environment.
Package the project code
Compress your_project into a .zip file:
zip -r your_project.zip your_project
Alternatively, create a .egg file. See Building Eggs.
Package the Python runtime environment
Use Conda or Virtualenv to package the Python runtime and third-party libraries into a tar file, then pass it via the spark.archives parameter.
| Tool | When to use |
|---|---|
| Conda | Use when your job requires a specific Python version or needs to run on nodes where Python is not pre-installed. Conda bundles the Python interpreter inside the tar file. |
| Virtualenv | Use when the Python version on the cluster nodes already matches your project. Virtualenv does not bundle the interpreter — it relies on the node's existing Python installation. |
Run the packaging step on Linux. The Lindorm compute engine requires Python binary files compiled for Linux.
Example: using Conda
conda create -y -n pyspark_conda_env -c conda-forge numpy conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
For other packaging options, see Python Package Management.
Step 3: Upload the files to OSS
Upload all three artifacts to your OSS bucket. See Simple upload.
-
launcher.py— the entry point created in step 1 -
your_project.zip(or.egg) — the project code package from step 2 -
pyspark_conda_env.tar.gz— the Python runtime environment from step 2
Step 4: Submit the job
The Lindorm compute engine supports two submission methods:
-
Lindorm console — See Manage jobs in the console.
-
DMS — See Manage jobs using DMS.
Regardless of the method, configure the following parameters in the configs field of your job submission request.
Runtime environment parameters
Set these three parameters to point the job at your OSS artifacts:
| Parameter | Description | Example |
|---|---|---|
spark.submit.pyFiles |
Path to the project .zip, .egg, or .py file in OSS |
oss://testBucketName/your_project.zip |
spark.archives |
Path to the Python runtime tar file. Use # to specify the target directory name. |
oss://testBucketName/pyspark_conda_env.tar.gz#environment |
spark.kubernetes.driverEnv.PYSPARK_PYTHON |
Path to the Python executable inside the extracted tar file | ./environment/bin/python |
OSS access parameters
Set these parameters so the compute engine can read your files from OSS:
| Parameter | Description | Example |
|---|---|---|
spark.hadoop.fs.oss.endpoint |
Endpoint of the OSS bucket storing your Python files | oss-cn-beijing-internal.aliyuncs.com |
spark.hadoop.fs.oss.accessKeyId |
AccessKey ID | testAccessKey ID |
spark.hadoop.fs.oss.accessKeySecret |
AccessKey secret | testAccessKey Secret |
spark.hadoop.fs.oss.impl |
Class used to access OSS | org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem |
For additional OSS-on-Hadoop parameters, see Hadoop-Aliyun documentation.
Example: assembled `configs` value
The following shows all runtime environment and OSS access parameters combined into a single configs object:
{
"spark.archives": "oss://testBucketName/pyspark_conda_env.tar.gz#environment",
"spark.kubernetes.driverEnv.PYSPARK_PYTHON": "./environment/bin/python",
"spark.submit.pyFiles": "oss://testBucketName/your_project.zip",
"spark.hadoop.fs.oss.endpoint": "oss-cn-beijing-internal.aliyuncs.com",
"spark.hadoop.fs.oss.accessKeyId": "<your-access-key-id>",
"spark.hadoop.fs.oss.accessKeySecret": "<your-access-key-secret>",
"spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
}
Replace <your-access-key-id> and <your-access-key-secret> with your actual AccessKey credentials.
Job diagnostics
After submitting the job, view its status and Spark UI address on the Jobs page. See View a job.
If submission fails, submit a ticket and provide the job ID and Spark UI address to the support team.