This topic provides answers to frequently asked questions about Spark.

How do I use Python 3 in PySpark jobs?

By default, Python 2 is used in PySpark jobs in E-MapReduce (EMR). This section describes how to change the Python version in PySpark jobs to Python 3. An EMR V3.35 cluster is used in the examples.

You can use one of the following methods to change the Python version:
  • Temporarily change the version
    1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to change the Python version:
      export PYSPARK_PYTHON=/usr/bin/python3
    3. Run the following command to view the Python version:
      pyspark
      If the output contains the following information, the Python version is changed to Python 3:
      Using Python version 3.6.8 (default, Apr 20 2020 14:49:33)
  • Permanently change the version
    Notice If you use this method to change the Python version, the new Python version takes effect on the entire cluster and may cause exceptions in the cluster. Proceed with caution.
    1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
    2. Modify the configuration file.
      1. Run the following command to open the profile file:
        vi /etc/profile
      2. Press the I key to enter the edit mode.
      3. Add the following information to the end of the profile file to change the Python version.
        export PYSPARK_PYTHON=/usr/bin/python3
        export
      4. Press the Esc key to exit the edit mode. Then, enter :wq to save and close the file.
    3. Run the following command for the configuration to take effect:
      source /etc/profile
    4. Run the following command to view the Python version:
      pyspark
      If the output contains the following information, the Python version is changed to Python 3:
      Using Python version 3.6.8 (default, Apr 20 2020 14:49:33)