All Products
Search
Document Center

E-MapReduce:Introduction to Python environments in PySpark

Last Updated:Mar 25, 2026

E-MapReduce (EMR) DataLake and custom clusters support PySpark with Python 3. The Python version available depends on the EMR version and Spark version you use. This topic covers the version mappings and how to install third-party Python libraries on cluster nodes.

Python version mappings

The following table shows the Python version and executable path for each EMR and Spark version combination.

EMR versionSpark versionPython versionPython path
EMR V3.46.0 or a later minor version, or EMR V5.12.0 or a later minor versionSpark 2Python 3.6/bin/python3.6
EMR V3.46.0 or a later minor version, or EMR V5.12.0 or a later minor versionSpark 3Python 3.8/bin/python3.8
EMR V3.43.0 to EMR V3.45.1, or EMR V5.9.0 to EMR V5.11.1Spark 2Python 3.7/usr/local/bin/python3.7
EMR V3.43.0 to EMR V3.45.1, or EMR V5.9.0 to EMR V5.11.1Spark 3Python 3.7/usr/local/bin/python3.7
EMR V3.42.0 or EMR V5.8.0Spark 2Python 3.6/bin/python3.6
EMR V3.42.0 or EMR V5.8.0Spark 3Python 3.6/bin/python3.6

Install third-party Python libraries

The following steps use Python 3.8 as an example.

Step 1: Install pip3.8

You do not need to manually install pip3.8 in EMR V3.46.0 to EMR V3.48.0 or in EMR V5.12.0 to EMR V5.14.0.
sudo yum install -y python38-pip

Step 2: Install third-party libraries

Install third-party Python libraries such as NumPy and pandas:

pip3.8 install numpy pandas

Step 3: Use a PyPI mirror for offline or accelerated installation

If a node cannot access the Internet, or if you want to speed up the installation, use a Python Package Index (PyPI) mirror provided by Alibaba Cloud.