This topic describes how to set up a Spark on MaxCompute development environment. To do so, you need to download the Spark on MaxCompute client, set environment variables, configure the Spark-defaults.conf file, and configure dependencies.

Download the Spark on MaxCompute client

The Spark on MaxCompute software packages are interoperable with the authentication function of MaxCompute. This allows Spark on MaxCompute to serve as a client that submits jobs with the spark-submit script that is encrypted. The following two Spark on MaxCompute software packages are provided to meet different needs:

Set environment variables

  • Set the JAVA_HOME environment variable as follows:
    # We recommend that you use JDK 1.8 or later.
    export JAVA_HOME=/path/to/jdk
    export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
    export PATH=$JAVA_HOME/bin:$PATH
  • Set the SPARK_HOME environment variable as follows:
    export SPARK_HOME=/path/to/spark_extracted_package
    export PATH=$SPARK_HOME/bin:$PATH
  • If you use PySpark, install Python 2.7 and set the PATH environment variable as follows:
    export PATH=/path/to/python/bin/:$PATH

Configure the Spark-defaults.conf file

You can use the spark-defaults.conf.template file in the $SPARK_HOME/conf directory as a template to prepare your own spark-defaults.conf file. However, before you submit Spark on MaxCompute jobs to MaxCompute, you must set the required MaxCompute account and region information in your spark-defaults.conf file.

In the spark-defaults.conf file, you can retain the default settings and enter the MaxCompute account information as follows:

# Set the MaxCompute project and account information:
spark.hadoop.odps.project.name =
spark.hadoop.odps.access.id =
spark.hadoop.odps.access.key =

# Configure the endpoint through which the Spark on MaxCompute client accesses MaxCompute projects (this endpoint varies depending on the network conditions and region):
spark.hadoop.odps.end.point = http://service.cn.maxcompute.aliyun.com/api
# Configure the endpoint that runs Spark on MaxCompute (this endpoint runs in the MaxCompute VPC in your region):
spark.hadoop.odps.runtime.end.point = http://service.cn.maxcompute.aliyun-inc.com/api

# Retain the following default settings:
spark.sql.catalogImplementation=odps
spark.hadoop.odps.task.major.version = cupid_v2
spark.hadoop.odps.cupid.container.image.enable = true
spark.hadoop.odps.cupid.container.vm.engine.type = hyper

Configure dependencies

  • Configure the dependencies for Spark on MaxCompute jobs to access MaxCompute tables.
    Spark on MaxCompute jobs use the odps-spark-datasource module to access MaxCompute tables. The Maven coordinates of this module are as follows:
    <!-- Spark-2.x uses the following module:-->
    <dependency>
        <groupId>com.aliyun.odps</groupId>
        <artifactId>odps-spark-datasource_2.11</artifactId>
        <version>3.3.3-public</version>
    </dependency>
    
    <!-- Spark-1.x uses the following module:-->
    <dependency>
      <groupId>com.aliyun.odps</groupId>
      <artifactId>odps-spark-datasource_2.10</artifactId>
      <version>3.3.3-public</version>
    </dependency>
  • Configure the dependencies for Spark on MaxCompute jobs to access OSS.
    If Spark on MaxCompute jobs need to access OSS, add the following dependencies:
    <dependency>
        <groupId>com.aliyun.odps</groupId>
        <artifactId>hadoop-fs-oss</artifactId>
        <version>3.3.3-public</version>
    </dependency>