This topic describes how to develop a Spark on MaxCompute application by using PySpark.

If you want to access MaxCompute tables, you must compile a package for the odps-spark-datasource package. For more information, see Set up a Spark on MaxCompute development environment.

Develop a Spark SQL application in Spark 1.6

  1. Compile the following code:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import OdpsContext
    if __name__ == '__main__':
        conf = SparkConf().setAppName("odps_pyspark")
        sc = SparkContext(conf=conf)
        sql_context = OdpsContext(sc)
        sql_context.sql("DROP TABLE IF EXISTS spark_sql_test_table")
        sql_context.sql("CREATE TABLE spark_sql_test_table(name STRING, num BIGINT)")
        sql_context.sql("INSERT INTO TABLE spark_sql_test_table SELECT 'abc', 100000")
        sql_context.sql("SELECT * FROM spark_sql_test_table").show()
        sql_context.sql("SELECT COUNT(*) FROM spark_sql_test_table").show()
  2. Run the following command to submit the code:
    ./bin/spark-submit \
    --jars cupid/odps-spark-datasource_xxx.jar \
    example.py

Develop a Spark SQL application in Spark 2.3

  1. Compile the following code:
    from pyspark.sql import SparkSession
    if __name__ == '__main__':
        spark = SparkSession.builder.appName("spark sql").getOrCreate()
        spark.sql("DROP TABLE IF EXISTS spark_sql_test_table")
        spark.sql("CREATE TABLE spark_sql_test_table(name STRING, num BIGINT)")
        spark.sql("INSERT INTO spark_sql_test_table SELECT 'abc', 100000")
        spark.sql("SELECT * FROM spark_sql_test_table").show()
        spark.sql("SELECT COUNT(*) FROM spark_sql_test_table").show()
  2. Run one of the following commands based on which mode you use:
    • In cluster mode, run the following command to submit the code:
      spark-submit --master yarn-cluster \
      --jars cupid/odps-spark-datasource_xxx.jar \
      example.py
    • In local mode, run the following command to submit the code:
      cd $SPARK_HOME
      ./bin/spark-submit --master local[4] \
      --driver-class-path cupid/odps-spark-datasource_xxx.jar \
      /path/to/odps-spark-examples/spark-examples/src/main/python/spark_sql.py

Required packages

A Python library cannot be directly installed in the MaxCompute cluster. When PySpark depends on a Python library, plugin, or project, you need to package this Python library, plugin, or project on your computer and then upload the package by running the spark-submit script. The Python version that you use to package the Python library, plugin, or project must be the same as the Python version in which your Spark on MaxCompute jobs run. You can choose one of the following two packaging formats: egg and Python.
  • Compile egg packages on your computer.
    For example, if MLlib requires the NumPy and Setuptools plugins, you need to compile egg packages for them and then upload the packages by running the --py-files script. Detailed steps are as follows:
    Note Spark on MaxCompute jobs run in Python 2.7. Therefore, you must package the plugins by using Python 2.7 on your computer.
    1. To compile egg packages for the NumPy and Setuptools plugins, follow these steps:
      1. Download the NumPy and Setuptools software packages found in Find, install and publish Python packages with the Python Package Index.
      2. Enter the source code path of the Setuptools plugin and run the python setup.py bdist_egg script. An egg file is generated in the dist directory.
      3. Enter the source code path of the NumPy plugin and run the python setupeggs.py bdist_egg script. An egg file is generated in the dist directory.
    2. Run the following command to submit your Spark on MaxCompute jobs:
      cd $SPARK_HOME
      ./bin/spark-submit --master yarn-cluster \
      --jars cupid/odps-spark-datasource_2.11-3.3.2-hotfix1.jar \
      --py-files /path/to/numpy-1.7.1-py2.7-lunux-x85_64.egg,/path/to/setuptools-33.1.1-py2.7.egg \
      app.py
  • Compile Python packages on your computer.
    If the Spark on MaxCompute application depends on a large number of plugins or any of these plugins contain files such as .so files that cannot be imported by the zipimport module, you need to download all the plugins, compile Python packages for the plugins, and then upload the packages. Detailed steps are as follows:
    1. Add the following configuration information:
      spark.pyspark.python=./public.python-2.7-ucs4.zip/python-2.7-ucs4/bin/python2.7
    2. Run the following command to submit your Spark on MaxCompute jobs:
      cd $SPARK_HOME
      ./bin/spark-submit --master yarn-cluster \
      --jars cupid/odps-spark-datasource_2.11-3.3.2-hotfix1.jar \
      --archives ./python-2.7-ucs4.zip app.py
    If you do not want to submit your Spark on MaxCompute jobs by running the --archives script, you can submit them as public resources.
    1. Add the following configuration information:
      spark.hadoop.odps.cupid.resources=public.python-2.7-ucs4.zip
      spark.pyspark.python=./public.python-2.7-ucs4.zip/python-2.7-ucs4/bin/python2.7
    2. Run the following command to submit your Spark on MaxCompute jobs:
      cd $SPARK_HOME
      ./bin/spark-submit --master yarn-cluster \
      --jars cupid/odps-spark-datasource_2.11-3.3.2-hotfix1.jar app.py
    If the Spark on MaxCompute application also depends on other plugins, you can package them with reference to the following script:
    work_root=`dirname $0`
    work_root=`cd ${work_root}; pwd`
    # Step 1 compile python
    # 1.1 python source code
    cd ${work_root}
    if [ ! -f Python-2.7.13.tgz ]; then
        wget https://www.python.org/ftp/python/2.7.13/Python-2.7.13.tgz
    fi
    # 1.2 configure && make && make install
    if [ ! -d ${work_root}/Python-2.7.13 ]; then
        cd ${work_root}
        tar xf ${work_root}/Python-2.7.13.tgz
    fi
    if [ -d ${work_root}/python-2.7-ucs4 ]; then
        rm -rf ${work_root}/python-2.7-ucs4
    fi
    cd ${work_root}/Python-2.7.13
    ./configure --prefix=${work_root}/python-2.7-ucs4 --enable-unicode=ucs4
    sed -i 's/#.*zlib zlibmodule.c/zlib zlibmodule.c/g' Modules/Setup
    make -j20
    make install
    # 1.3 install pip
    cd ${work_root}
    if [ ! -f get-pip.py ]; then
        curl -s https://bootstrap.pypa.io/get-pip.py -o ${work_root}/get-pip.py
    fi
    ${work_root}/python-2.7-ucs4/bin/python ${work_root}/get-pip.py
    # 1.4 install numpy
    ${work_root}/python-2.7-ucs4/bin/pip install numpy
    # 1.6 make python zip
    if [ -f ${work_root}/python-2.7-ucs4.zip ]; then
        rm -rf ${work_root}/python-2.7-ucs4.zip
    fi
    cd ${work_root}
    zip -r ${work_root}/python-2.7-ucs4.zip python-2.7-ucs4