PySpark jobs running on EMR Serverless Spark execute across distributed nodes, so every Python library your code imports must be available on each node — not just on your local machine. Without explicit dependency management, tasks fail with errors like ModuleNotFoundError: No module named 'pandas'.
EMR Serverless Spark supports three methods to distribute Python dependencies. Choose based on your environment and workflow:
| Method | Best for |
|---|---|
| Runtime environments | Reusable, console-managed environments shared across multiple jobs. The system builds and maintains the environment automatically. |
| Conda | Jobs that need a specific Python version or libraries with complex native dependencies. Requires building the environment on a compatible ECS instance. |
| PEX | Lightweight, self-contained packaging of pure-Python dependencies into a single executable file. |
Prerequisites
Before you begin, ensure that you have:
-
A workspace. See Create a workspace.
-
Python 3.8 or later installed. The examples in this topic use Python 3.8.
Method 1: Use runtime environments
Runtime environments let you define a set of PyPI libraries once in the console. EMR Serverless Spark builds and manages the environment automatically, and you can attach it to any job without re-packaging dependencies.
Step 1: Create a runtime environment
-
Log on to the E-MapReduce console.
-
In the navigation pane, choose EMR Serverless > Spark.
-
Click the name of your workspace.
-
In the navigation pane, click Environment.
-
Click Create Environment.
-
On the Create Environment page, click Add Library. For parameter details, see Manage runtime environments.
-
In the New Library dialog box, set the source type to PyPI, enter the library name and version in the PyPI Package field, and click OK. If you omit the version, the latest version is installed. This example adds two libraries:
fakerandgeopy. -
Click Create. The system initializes the environment after creation.
Step 2: Upload the script to OSS
-
Click pyspark_third_party_libs_demo.py to download the sample script. The script uses Faker to generate 100 synthetic user records with random coordinates near Paris, then uses geopy to calculate the geodesic distance from each user to the Eiffel Tower (48.8584, 2.2945), and filters for users within 10 kilometers. Alternatively, create
pyspark_third_party_libs_demo.pywith the following content: -
Upload
pyspark_third_party_libs_demo.pyto OSS. See Simple upload.
Step 3: Run the job
-
In the navigation pane of the EMR Serverless Spark page, click Development.
-
On the Development tab, click the
icon. -
In the New dialog box, enter a name, select Application(Batch) > PySpark from the Type drop-down list, and click OK.
-
In the upper-right corner, select a queue.
-
On the job configuration tab, set the following parameters and click Run.
Parameter Value Main Python Resources Select OSS and enter the OSS path of pyspark_third_party_libs_demo.py. Example:oss://<yourBucketName>/pyspark_third_party_libs_demo.pyEnvironment Select the runtime environment you created. -
After the job finishes, click Logs in the Actions column under Execution Records.
-
On the Log Exploration tab, open the Stdout tab under Driver Log to view the output. Expected output:
Generated sample data: +--------------------+-------------------+------------------+------------------+ | user_id| name| latitude| longitude| +--------------------+-------------------+------------------+------------------+ |73d4565c-8cdf-4bc...| Garrett Robertson| 48.81845614776422|2.4087517234236064| |0fc364b1-6759-416...| Dawn Gonzalez| 48.68654896170054|2.4708555780468013| |2ab1f0aa-5552-4e1...|Alexander Gallagher| 48.87603770688707|2.1209399987431246| |1cabbdde-e703-4a8...| David Morris|48.656356532418116|2.2503952330408175| |8b7938a0-b283-401...| Shannon Perkins| 48.82915001905855| 2.410743969589327| +--------------------+-------------------+------------------+------------------+ only showing top 5 rows Found 24 users within a 10-kilometer range: +-----------------+------------------+------------------+-----------+ | name| latitude| longitude|distance_km| +-----------------+------------------+------------------+-----------+ |Garrett Robertson| 48.81845614776422|2.4087517234236064| 9.490705| | Shannon Perkins| 48.82915001905855| 2.410743969589327| 9.131355| | Alex Harris| 48.82547383207313|2.3579336032430027| 5.923493| | Tammy Ramos| 48.84668267431606|2.3606455536493574| 5.026109| | Ivan Christian| 48.89224239228342|2.2811025348668195| 3.8897192| | Vernon Humphrey| 48.93142188723839| 2.306957802222233| 8.171813| | Shawn Rodriguez|48.919907710882654|2.2270993307836044| 8.439087| | Robert Fisher|48.794216103154646|2.3699024070507906| 9.033209| | Heather Collier|48.822957591865205|2.2993033803043454| 3.957171| | Dawn White|48.877816307255586|2.3743880390928878| 6.246059| +-----------------+------------------+------------------+-----------+ only showing top 10 rows
Method 2: Manage dependencies with Conda
Conda lets you create a reproducible Python environment with a specific Python version and library set, package it as a tarball, and deploy it to each Spark node via the Archive Resources parameter.
Conda and PEX environments must be built on an x86 ECS instance running Alibaba Cloud Linux 3.
Step 1: Build and package the Conda environment
-
Create an Elastic Compute Service (ECS) instance with the following configuration. See Create an instance on the Custom Launch tab.
-
OS: Alibaba Cloud Linux 3
-
Architecture: x86
-
Internet access: enabled
You can also use an idle node from an existing EMR cluster created on the EMR on ECS page, as long as it has an x86 architecture.
-
-
Install Miniconda on the instance:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh chmod +x Miniconda3-latest-Linux-x86_64.sh ./Miniconda3-latest-Linux-x86_64.sh -b source miniconda3/bin/activate -
Create and package a Conda environment with Python 3.8 and NumPy:
conda create -y -n pyspark_conda_env -c conda-forge conda-pack numpy python=3.8 conda activate pyspark_conda_env conda pack -f -o pyspark_conda_env.tar.gzThis produces
pyspark_conda_env.tar.gz, which contains the full Python environment.
Step 3: Run the job
-
In the navigation pane of the EMR Serverless Spark page, click Development.
-
On the Development tab, click the
icon. -
In the New dialog box, enter a name, select Application(Batch) > PySpark from the Type drop-down list, and click OK.
-
In the upper-right corner, select a queue.
-
On the job configuration tab, set the following parameters and click Run.
Parameter Value Main Python Resources Select OSS and enter the OSS path of kmeans.py. Example:oss://<yourBucketName>/kmeans.pyExecution Parameters Enter the OSS path of kmeans_data.txtfollowed by the number of clusters. Format:oss://<yourBucketName>/kmeans_data.txt 2Archive Resources Select OSS and enter the OSS path of pyspark_conda_env.tar.gz. Format:oss://<yourBucketName>/pyspark_conda_env.tar.gz#condaenvSpark Configuration spark.pyspark.driver.python ./condaenv/bin/python<br>spark.pyspark.python ./condaenv/bin/python -
After the job finishes, click Logs in the Actions column under Execution Records.
-
On the Log Exploration tab, open the Stdout tab under Driver Log to view the output. Expected output:
Final centers: [array([0.1, 0.1, 0.1]), array([9.1, 9.1, 9.1])] Total Cost: 0.11999999999999958
Method 3: Package dependencies with PEX
PEX (Python EXecutable) packages Python libraries into a single self-contained file, which Spark deploys to each node as a file resource.
Match the PEX package versions to your Spark engine version — see Engine versions for the Spark version your workspace uses.
Step 1: Build the PEX file
-
Create an ECS instance with the following configuration. See Create an instance on the Custom Launch tab.
-
OS: Alibaba Cloud Linux 3
-
Architecture: x86
-
Internet access: enabled
You can also use an idle node from an existing EMR cluster created on the EMR on ECS page, as long as it has an x86 architecture.
-
-
Install the PEX and wheel tools:
pip3.8 install --user pex wheel \ --trusted-host mirrors.cloud.aliyuncs.com \ -i http://mirrors.cloud.aliyuncs.com/pypi/simple/ -
Download wheel files for the target libraries into a local directory:
pip3.8 wheel -w /tmp/wheel \ pyspark==3.3.1 pandas==1.5.3 pyarrow==15.0.1 numpy==1.24.4 \ --trusted-host mirrors.cloud.aliyuncs.com \ -i http://mirrors.cloud.aliyuncs.com/pypi/simple/ -
Bundle the wheel files into a PEX file:
pex -f /tmp/wheel --no-index \ pyspark==3.3.1 pandas==1.5.3 pyarrow==15.0.1 numpy==1.24.4 \ -o spark331_pandas153.pexThis example targets Spark 3.3.1 and bundles pandas, PyArrow, and NumPy. Adjust the versions to match your Spark engine.
Step 2: Upload resource files to OSS
-
Download the sample files:
-
Upload
spark331_pandas153.pex,kmeans.py, andkmeans_data.txtto OSS. See Simple upload.
Step 3: Run the job
-
In the navigation pane of the EMR Serverless Spark page, click Development.
-
On the Development tab, click the
icon. -
In the New dialog box, enter a name, select Application(Batch) > PySpark from the Type drop-down list, and click OK.
-
In the upper-right corner, select a queue.
-
On the job configuration tab, set the following parameters and click Run.
Parameter Value Main Python Resources Select OSS and enter the OSS path of kmeans.py. Example:oss://<yourBucketName>/kmeans.pyExecution Parameters Enter the OSS path of kmeans_data.txtfollowed by the number of clusters. Format:oss://<yourBucketName>/kmeans_data.txt 2File Resources Select OSS and enter the OSS path of spark331_pandas153.pex. Example:oss://<yourBucketName>/spark331_pandas153.pexSpark Configuration spark.pyspark.driver.python ./spark331_pandas153.pex<br>spark.pyspark.python ./spark331_pandas153.pex -
After the job finishes, click Logs in the Actions column under Execution Records.
-
On the Log Exploration tab, open the Stdout tab under Driver Log to view the output. Expected output:
Final centers: [array([0.1, 0.1, 0.1]), array([9.1, 9.1, 9.1])] Total Cost: 0.11999999999999958
What's next
This topic uses PySpark batch jobs as examples. To develop other job types, see Develop a batch or streaming job.