PySpark calls Python API operations to run Spark jobs. PySpark jobs must be run in a specific Python environment. By default, E-MapReduce (EMR) supports Python. If the Python version supported by EMR cannot be used to run PySpark jobs, you can refer to this topic to configure a Python environment that meets specific requirements and run PySpark jobs in DataWorks.
Prerequisites
In this topic, the DataWorks workspace and EMR cluster that are used reside in the same region. This section describes the prerequisites that must be met at the DataWorks and EMR sides:- DataWorks side
An EMR Spark node is created to run PySpark jobs in DataWorks, and the
spark-submit
command is run to submit PySpark jobs. - EMR sideAn EMR environment that includes the following configurations is prepared:
- An EMR cluster. In this example, an EMR on ECS cluster is used.
- Optional:A Python package used for sample verification. You can package a Python environment on your on-premises machine or Elastic Compute Service (ECS) instance. Alternatively, you can directly download the sample Python package Python 3.7 that is used in this topic. To package a Python environment, make sure that the Docker runtime environment and the Python runtime environment are installed on your on-premises machine or ECS instance. Note In this topic, Python 3.7 is used only for reference. You can select a Python version based on your business requirements. The Python version supported by EMR may be different from the Python version that you use. We recommend that you use Python 3.7.
Procedure
- Optional:Prepare the virtual environment that is required to run a Python program. You can directly download the Python 3.7 package or perform the following steps to package a Python environment:
- Upload the copied environment. You can upload the copied environment to Hadoop Distributed File System (HDFS) or Object Storage Service (OSS) based on your business requirements.Note In this example, the copied environment is uploaded to HDFS. For information about how to upload the copied environment to OSS, see Simple upload.Run the following command to upload the copied environment to HDFS:
# Upload the copied environment to HDFS. hdfs dfs -copyFromLocal python3.7.zip /tmp/pyspark
- Test and upload the Python code.
- Run the
spark-submit
command to submit jobs on the EMR Spark node in DataWorks.On the created EMR Spark node, run the following command to submit jobs:Note If you upload the Python code to OSS, replace the HDFS paths involved in the code with the related OSS paths.spark-submit --master yarn \ --deploy-mode cluster \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./PYTHONENV/python3/bin/python3.7 \ --conf spark.executorEnv.PYTHONPATH=. \ --conf spark.yarn.appMasterEnv.PYTHONPATH=. \ --conf spark.yarn.appMasterEnv.JOBOWNER=LiuYuQuan \ --archives hdfs://hdfs-cluster/tmp/pyspark/python3.7.zip#PYTHONENV \ ## --py-files hdfs://hdfs-cluster/tmp/pyspark/mc_pyspark-0.1.0-py3-none-any.zip \ --driver-memory 4g \ --driver-cores 1 \ --executor-memory 4g \ --executor-cores 1 \ --num-executors 3 \ --name TestPySpark \ hdfs://hdfs-cluster/tmp/pyspark/pyspark_test.py