All Products
Search
Document Center

E-MapReduce:Introduction to the Python environment in PySpark

Last Updated:Oct 22, 2024

The dependencies of Spark that is deployed in an E-MapReduce (EMR) DataLake or custom cluster on the Python environment vary based on the version of Spark. This topic uses Python 3 as an example to describe the mappings between Spark versions and Python versions.This topic also describes how to install a third-party Python library.

Mappings between Spark versions and Python versions

EMR version

Spark version

Python version

Python path

EMR V3.46.0 or a later minor version, or EMR V5.12.0 or a later minor version

Spark 2

Python 3.6

/bin/python3.6

Spark 3

Python 3.8

/bin/python3.8

EMR V3.43.0 to EMR V3.45.1, or EMR V5.9.0 to EMR V5.11.1

Spark 2

Python 3.7

/usr/local/bin/python3.7

Spark 3

Python 3.7

/usr/local/bin/python3.7

EMR V3.42.0 or EMR V5.8.0

Spark 2

Python 3.6

/bin/python3.6

Spark 3

Python 3.6

/bin/python3.6

Install third-party Python libraries

  • Install pip3.8.

    You do not need to manually install pip3.8 in EMR V3.46.0 to EMR V3.48.0 or in EMR V5.12.0 to EMR V5.14.0.

    sudo yum install -y python38-pip
  • Install third-party Python libraries such as NumPy and pandas. Python 3.8 is used as an example.

    pip3.8 install numpy pandas
  • If an EMR node, such as a core or task node, cannot access the Internet or you want to accelerate the installation process, you can use a Python Package Index (PyPI) image provided by Alibaba Cloud.