Introduction to the Python environment in PySpark - E-MapReduce

The dependencies of Spark that is deployed in an E-MapReduce (EMR) DataLake or custom cluster on the Python environment vary based on the version of Spark. This topic uses Python 3 as an example to describe the mappings between Spark versions and Python versions.This topic also describes how to install a third-party Python library.

Mappings between Spark versions and Python versions

EMR version	Spark version	Python version	Python path
EMR V3.46.0 or a later minor version, or EMR V5.12.0 or a later minor version	Spark 2	Python 3.6	/bin/python3.6
	Spark 3	Python 3.8	/bin/python3.8
EMR V3.43.0 to EMR V3.45.1, or EMR V5.9.0 to EMR V5.11.1	Spark 2	Python 3.7	/usr/local/bin/python3.7
EMR V3.43.0 to EMR V3.45.1, or EMR V5.9.0 to EMR V5.11.1	Spark 3	Python 3.7	/usr/local/bin/python3.7
EMR V3.42.0 or EMR V5.8.0	Spark 2	Python 3.6	/bin/python3.6
EMR V3.42.0 or EMR V5.8.0	Spark 3	Python 3.6	/bin/python3.6

Install third-party Python libraries

Install pip3.8.
You do not need to manually install pip3.8 in EMR V3.46.0 to EMR V3.48.0 or in EMR V5.12.0 to EMR V5.14.0.
```
sudo yum install -y python38-pip
```
Install third-party Python libraries such as NumPy and pandas. Python 3.8 is used as an example.
```
pip3.8 install numpy pandas
```
If an EMR node, such as a core or task node, cannot access the Internet or you want to accelerate the installation process, you can use a Python Package Index (PyPI) image provided by Alibaba Cloud.