E-MapReduce (EMR) DataLake and custom clusters support PySpark with Python 3. The Python version available depends on the EMR version and Spark version you use. This topic covers the version mappings and how to install third-party Python libraries on cluster nodes.
Python version mappings
The following table shows the Python version and executable path for each EMR and Spark version combination.
| EMR version | Spark version | Python version | Python path |
|---|---|---|---|
| EMR V3.46.0 or a later minor version, or EMR V5.12.0 or a later minor version | Spark 2 | Python 3.6 | /bin/python3.6 |
| EMR V3.46.0 or a later minor version, or EMR V5.12.0 or a later minor version | Spark 3 | Python 3.8 | /bin/python3.8 |
| EMR V3.43.0 to EMR V3.45.1, or EMR V5.9.0 to EMR V5.11.1 | Spark 2 | Python 3.7 | /usr/local/bin/python3.7 |
| EMR V3.43.0 to EMR V3.45.1, or EMR V5.9.0 to EMR V5.11.1 | Spark 3 | Python 3.7 | /usr/local/bin/python3.7 |
| EMR V3.42.0 or EMR V5.8.0 | Spark 2 | Python 3.6 | /bin/python3.6 |
| EMR V3.42.0 or EMR V5.8.0 | Spark 3 | Python 3.6 | /bin/python3.6 |
Install third-party Python libraries
The following steps use Python 3.8 as an example.
Step 1: Install pip3.8
You do not need to manually install pip3.8 in EMR V3.46.0 to EMR V3.48.0 or in EMR V5.12.0 to EMR V5.14.0.
sudo yum install -y python38-pipStep 2: Install third-party libraries
Install third-party Python libraries such as NumPy and pandas:
pip3.8 install numpy pandasStep 3: Use a PyPI mirror for offline or accelerated installation
If a node cannot access the Internet, or if you want to speed up the installation, use a Python Package Index (PyPI) mirror provided by Alibaba Cloud.