The dependencies of Spark that is deployed in an E-MapReduce (EMR) DataLake or custom cluster on the Python environment vary based on the version of Spark. This topic uses Python 3 as an example to describe the mappings between Spark versions and Python versions.This topic also describes how to install a third-party Python library.
Mappings between Spark versions and Python versions
EMR version | Spark version | Python version | Python path |
EMR V3.46.0 or a later minor version, or EMR V5.12.0 or a later minor version | Spark 2 | Python 3.6 | /bin/python3.6 |
Spark 3 | Python 3.8 | /bin/python3.8 | |
EMR V3.43.0 to EMR V3.45.1, or EMR V5.9.0 to EMR V5.11.1 | Spark 2 | Python 3.7 | /usr/local/bin/python3.7 |
Spark 3 | Python 3.7 | /usr/local/bin/python3.7 | |
EMR V3.42.0 or EMR V5.8.0 | Spark 2 | Python 3.6 | /bin/python3.6 |
Spark 3 | Python 3.6 | /bin/python3.6 |
Install third-party Python libraries
Install pip3.8.
You do not need to manually install pip3.8 in EMR V3.46.0 to EMR V3.48.0 or in EMR V5.12.0 to EMR V5.14.0.
sudo yum install -y python38-pip
Install third-party Python libraries such as NumPy and pandas. Python 3.8 is used as an example.
pip3.8 install numpy pandas
If an EMR node, such as a core or task node, cannot access the Internet or you want to accelerate the installation process, you can use a Python Package Index (PyPI) image provided by Alibaba Cloud.