This topic describes how to develop a Spark on MaxCompute application by using Java or Scala.

Download an example project

You can run the following commands to download an example project:
git clone git@github.com:aliyun/aliyun-cupid-sdk.git
cd aliyun-cupid-sdk
git checkout 3.3.3-public
# Download an example project for Spark-2.x.
cd spark/spark-2.x/spark-examples
# Download an example project for Spark-1.x.
cd spark/spark-1.x/spark-examples
# Package data to create a shaded JAR package in the target directory.
mvn clean package

Configure dependencies for Spark-1.x

If you want to submit your Spark-1.x application by using the Spark on MaxCompute client, you must add the following dependencies to the pom.xml file:
<properties>
    <spark.version>1.6.3</spark.version>
    <cupid.sdk.version>3.3.3-public</cupid.sdk.version>
    <scala.version>2.10.4</scala.version>
    <scala.binary.version>2.10</scala.binary.version>
</properties>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-mllib_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>cupid-sdk</artifactId>
    <version>${cupid.sdk.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>hadoop-fs-oss</artifactId>
    <version>${cupid.sdk.version}</version>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>odps-spark-datasource_${scala.binary.version}</artifactId>
    <version>${cupid.sdk.version}</version>
</dependency>
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>${scala.version}</version>
</dependency>
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-actors</artifactId>
    <version>${scala.version}</version>
</dependency>
Note You need to set the scope parameter as follows:
  • Set it to provided for all packages that are released in the Apache Spark community, such as spark-core and spark-sql.
  • Set it to compile for the odps-spark-datasource module.

Develop a Spark-1.x application

  • Develop the WordCount application.

    For this application, you will need to download aliyun-cupid-sdk.

    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.WordCount \
      ${path to aliyun-cupid-sdk}/spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
  • Develop the Spark SQL application on MaxCompute tables.

    For this application, you will need to download aliyun-cupid-sdk.

    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.sparksql.SparkSQL \
      ${path to aliyun-cupid-sdk}/spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
    Note
    • If the "Table Not Found" error is returned, the table you specify in the code cannot be found in the MaxCompute project.
    • You can develop a Spark SQL application for the target table with reference to various APIs in the code.
  • Develop the GraphX PageRank application.

    For this application, you will need to download aliyun-cupid-sdk.

    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.graphx.PageRank \
      ${path to aliyun-cupid-sdk}/spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
  • Develop the MLlib Kmeans-ON-OSS application.

    For this application, you will need to download aliyun-cupid-sdk.

    Note Before you submit the code, make sure that you enter the following OSS account information in the code:
    conf.set("spark.hadoop.fs.oss.accessKeyId", "***")
    conf.set("spark.hadoop.fs.oss.accessKeySecret", "***")
    conf.set("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.mllib.KmeansModelSaveToOss \
      ${path to aliyun-cupid-sdk}/spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar
  • Devel the OSS UnstructuredData application.

    For this application, you will need to download aliyun-cupid-sdk.

    Note Before you submit the code, make sure that you enter the following OSS account information in the code:
    conf.set("spark.hadoop.fs.oss.accessKeyId", "***")
    conf.set("spark.hadoop.fs.oss.accessKeySecret", "***")
    conf.set("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.oss.SparkUnstructuredDataCompute \
      ${path to aliyun-cupid-sdk}/spark/spark-1.x/spark-examples/target/spark-examples_2.10-version-shaded.jar

Configure dependencies for Spark-2.x

If you want to submit your Spark-2.x application by using the Spark on MaxCompute client, you must add the following dependencies to the pom.xml file:
<properties>
    <spark.version>2.3.0</spark.version>
    <cupid.sdk.version>3.3.3-public</cupid.sdk.version>
    <scala.version>2.11.8</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
</properties>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-mllib_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>cupid-sdk</artifactId>
    <version>${cupid.sdk.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>hadoop-fs-oss</artifactId>
    <version>${cupid.sdk.version}</version>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>odps-spark-datasource_${scala.binary.version}</artifactId>
    <version>${cupid.sdk.version}</version>
</dependency>
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>${scala.version}</version>
</dependency>
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-actors</artifactId>
    <version>${scala.version}</version>
</dependency>
Note You need to set the scope parameter as follows:
  • Set it to provided for all packages that are released in the Apache Spark community, such as spark-core and spark-sql.
  • Set it to compile for the odps-spark-datasource module.

Develop a Spark-2.x application

  • Develop the WordCount application.

    For this application, you will need to download aliyun-cupid-sdk.

    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.WordCount \
      ${path to aliyun-cupid-sdk}/spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
  • Develop the Spark SQL application on MaxCompute tables.

    For this application, you will need to download aliyun-cupid-sdk.

    Note
    • If the "Table Not Found" error is returned, the table you specify in the code cannot be found in the MaxCompute project.
    • You can develop a Spark SQL application for the target table with reference to various APIs in the code.
    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.sparksql.SparkSQL \
      ${path to aliyun-cupid-sdk}/spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
  • Develop the GraphX PageRank application.

    For this application, you will need to download aliyun-cupid-sdk.

    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.graphx.PageRank \
      ${path to aliyun-cupid-sdk}/spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
  • Develop the MLlib Kmeans-ON-OSS application.

    For this application, you will need to download aliyun-cupid-sdk.

    Note Before you submit the code, make sure that you enter the following OSS account information in the code:
    val spark = SparkSession
          .builder()
          .config("spark.hadoop.fs.oss.accessKeyId", "***")
          .config("spark.hadoop.fs.oss.accessKeySecret", "***")
          .config("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
          .appName("KmeansModelSaveToOss")
          .getOrCreate()
    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.mllib.KmeansModelSaveToOss \
      ${path to aliyun-cupid-sdk}/spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
  • Develop the OSS UnstructuredData application.

    For this application, you will need to download aliyun-cupid-sdk.

    Note Before you submit the code, make sure that you enter the following OSS account information in the code:
    val spark = SparkSession
          .builder()
          .config("spark.hadoop.fs.oss.accessKeyId", "***")
          .config("spark.hadoop.fs.oss.accessKeySecret", "***")
          .config("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
          .appName("SparkUnstructuredDataCompute")
          .getOrCreate()
    To submit the code, follow these steps:
    1. Build the aliyun-cupid-sdk module.
    2. Configure the spark.defaults.conf file.
    3. Run the following script:
      bin/spark-submit --master yarn-cluster --class \
      com.aliyun.odps.spark.examples.oss.SparkUnstructuredDataCompute \
      ${path to aliyun-cupid-sdk}/spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar