Lindorm Distributed Processing System (LDPS) provides a RESTful API that allows you to submit Python-based Spark jobs. You can submit Python-based Spark jobs to run stream processing tasks, batch processing tasks, machine learning tasks, and graph computing tasks. This topic describes how to create a job in Python and submit the job to LDPS for execution.
Prerequisites
LDPS is activated for your Lindorm instance. For more information, see Activate LDPS and modify the configurations.
Procedure
Step 1: Define a Python-based Spark job
- Click Sample Spark job to download the demos provided by LDPS.
- Extract files from the package that is downloaded. The name of the folder that contains demos of sample jobs is lindorm-spark-examples. Go to the lindorm-spark-examples/python directory and view the files in the python folder.
- The root directory of the project is your_project. Perform the following steps to create a Python-based Spark job:
Step 2: Package the Python-based Spark job
- Package the Python runtime files and third-party class libraries on which the project depends. We recommend that you use Conda or Virtualenv to package the dependent class libraries into a tar package. For more information, see Python Package Management. Important You must complete this operation in Linux. This ensures that LDPS can recognize the binary files in Python.
- Package the files of the project. Package the files of the your_project project to a ZIP or EGG file.
- You can run the following command to package the files of the your_project project to a ZIP file:
zip -r project.zip your_project
- For information about how to package the files of the your_project project to an EGG file:, see Building Eggs.
- You can run the following command to package the files of the your_project project to a ZIP file:
Step 3: Upload the files of the Python-based Spark job
Step 4: Submit the Python-based Spark job
- Submit jobs by using the Lindorm console. For more information, see Manage jobs in the Lindorm console.
- Submit jobs by using Data Management (DMS). For more information, see Use DMS to manage jobs.
- Parameters that are used to specify the runtime environment of the Python-based job. Example:
{"spark.archives":"oss://testBucketName/pyspark_conda_env.tar.gz#environment", "spark.kubernetes.driverEnv.PYSPARK_PYTHON":"./environment/bin/python","spark.submit.pyFiles":"oss://testBucketName/your_project.zip"}
- To submit the files of the project, configure the spark.submit.pyFiles parameter in the value of the configs parameter. You can specify the OSS path of the ZIP or EGG file that contains the files of the project, or the launcher.py file of the project.
- To submit the .tar file that contains the Python runtime files and third-party class libraries, configure the spark.archives parameter and spark.kubernetes.driverEnv.PYSPARK_PYTHON parameter in the value of the configs parameter.
- Use a number sign (#) to specify the value of the targetDir parameter when you configure the spark.archives parameter.
- The spark.kubernetes.driverEnv.PYSPARK_PYTHON parameter specifies the directory in which the files of the Python runtime environment are stored.
- Parameters that are required to upload a file to OSS. You need to include the parameters in the value of the configs parameters.
Table 1. Parameters Parameter Example Description spark.hadoop.fs.oss.endpoint oss-cn-beijing-internal.aliyuncs.com The endpoint of the OSS bucket in which you want to store the file. spark.hadoop.fs.oss.accessKeyId testAccessKey ID The Access Key ID and Access Key Secret that are used for identity authorization. You can create or obtain Access Key IDs and Access Key Secrets in the Alibaba Cloud Management Console. For more information, see Obtain an AccessKey pair. spark.hadoop.fs.oss.accessKeySecret testAccessKey Secret spark.hadoop.fs.oss.impl Set the value to org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem. The class that is used to access OSS. Note For more information about the parameters, see Parameters.
Examples
- Click Sample Spark job to download the file that contains demos and then extract files from the file that is downloaded.
- Open the your_project/main.py file and modify the entry point of the Python project.
- Add the path of your_project file to sys.path.
current_dir = os.path.abspath(os.path.dirname(__file__)) sys.path.append(current_dir) print("current dir in your_project: %s" % current_dir) print("sys.path: %s \n" % str(sys.path))
- Specify the entry logic in the main.py file. You can use the following sample code to initialize the SparkSession session.
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("PythonImportTest") \ .getOrCreate() print(spark.conf) spark.stop()
- Add the path of your_project file to sys.path.
- Package the your_project file in the Python directory to a .zip file.
zip -r your_project.zip your_project
- In Linux, use Conda to package the files of the Python runtime environment.
conda create -y -n pyspark_conda_env -c conda-forge numpy conda-pack conda activate pyspark_conda_env conda pack -f -o pyspark_conda_env.tar.gz
- Upload the your_project.zip file, pyspark_conda_env.tar.gz file, and launcher.py file to OSS.
- Submit the job by using one of the following methods:
- Submit the job by using the Lindorm console. For more information, see Manage jobs in the Lindorm console.
- Submit the job by using Data Management (DMS). For more information, see Use DMS to manage jobs.
Job diagnostics
After the job is submitted, you can view the status of the job and the address of the Spark web user interface (UI) on the Jobs page. For more information, see View the details of a Spark job. If you have any problems when you submit the job, submit a ticket and provide the job ID and the address of the Spark web UI.