This topic shows how to configure Spark-2.x dependencies, including some examples.

Configure dependencies for Spark-2.x

If you want to submit your Spark-2.x application by using the Spark on MaxCompute client, you must add the following dependencies in the pom.xml file. For more information about pom.xml, see pom.xml.
<properties>
    <spark.version>2.3.0</spark.version>
    <cupid.sdk.version>3.3.8-public</cupid.sdk.version>
    <scala.version>2.11.8</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
</properties>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_${scala.binary.version}</artifactId>
    <version>${spring-version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-mllib_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_${scala.binary.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>cupid-sdk</artifactId>
    <version>${cupid.sdk.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>hadoop-fs-oss</artifactId>
    <version>${cupid.sdk.version}</version>
</dependency>
<dependency>
    <groupId>com.aliyun.odps</groupId>
    <artifactId>odps-spark-datasource_${scala.binary.version}</artifactId>
    <version>${cupid.sdk.version}</version>
</dependency>
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>${scala.version}</version>
</dependency>
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-actors</artifactId>
    <version>${scala.version}</version>
</dependency>
In the preceding code, set the scope parameter as follows:
  • Set it to provided for all packages that are released in the Apache Spark community, such as spark-core and spark-sql.
  • Set it to compile for the odps-spark-datasource module.

WordCount example

  • Detailed code
  • How to submit
    cd /path/to/MaxCompute-Spark/spark-2.x
    mvn clean package
    
    # For the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --class \
        com.aliyun.odps.spark.examples.WordCount \
        /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

Example of Spark-SQL on MaxCompute Table

  • Detailed code
  • How to submit
    # If the table you specify in the code cannot be found in the MaxCompute project, a "Table Not Found" error will be returned.
    # You can develop a Spark SQL application for the target table with reference to various APIs in the code.
    Step 1. build aliyun-cupid-sdk
    Step 2. properly set spark.defaults.conf
    Step 3. bin/spark-submit --master yarn-cluster --class \
          com.aliyun.odps.spark.examples.sparksql.SparkSQL \
          ${path to aliyun-cupid-sdk}/spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar

GraphX PageRank example

  • Detailed code
  • How to submit
    cd /path/to/MaxCompute-Spark/spark-2.x
    mvn clean package
    # For more information about the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --class \
        com.aliyun.odps.spark.examples.graphx.PageRank \
        /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

Mllib Kmeans-ON-OSS examples

For more information about how to configure spark.hadoop.fs.oss.ststoken.roleArn and spark.hadoop.fs.oss.endpoint, see OSS access notes.

  • Detailed code
  • How to submit
    # Edit code
    val modelOssDir = "oss: // bucket/kmeans-model" // Enter the specific OSS Bucket path.
    val spark = SparkSession
      .builder()
      .config("spark.hadoop.fs.oss.credentials.provider", "org.apache.hadoop.fs.aliyun.oss.AliyunStsTokenCredentialsProvider")
      .config("spark.hadoop.fs.oss.ststoken.roleArn", "acs:ram::****:role/aliyunodpsdefaultrole")
      .config("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
      .appName("KmeansModelSaveToOss")
      .getOrCreate()
    
    cd /path/to/MaxCompute-Spark/spark-2.x
    mvn clean package
    # For more information about configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --class \
        com.aliyun.odps.spark.examples.mllib.KmeansModelSaveToOss
        /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

OSS UnstructuredData example

For more information about how to configure spark.hadoop.fs.oss.ststoken.roleArn and spark.hadoop.fs.oss.endpoint, see OSS access notes.

  • Detailed code
  • How to submit
    # Edit code
    val pathIn = "oss: // bucket/inputdata/" // Enter the specific OSS Bucket path.
    val spark = SparkSession
      .builder()
      .config("spark.hadoop.fs.oss.credentials.provider", "org.apache.hadoop.fs.aliyun.oss.AliyunStsTokenCredentialsProvider")
      .config("spark.hadoop.fs.oss.ststoken.roleArn", "acs:ram::****:role/aliyunodpsdefaultrole")
      .config("spark.hadoop.fs.oss.endpoint", "oss-cn-hangzhou-zmf.aliyuncs.com")
      .appName("SparkUnstructuredDataCompute")
      .getOrCreate()
    
    cd /path/to/MaxCompute-Spark/spark-2.x
    mvn clean package
    # For more information about the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --class \
        com.aliyun.odps.spark.examples.oss.SparkUnstructuredDataCompute \
        /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

MaxCompute table I/O example

  • Detailed code
  • How to submit
    cd /path/to/MaxCompute-Spark/spark-2.x
    mvn clean package
    # For more information about the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
        bin/spark-submit --master yarn-cluster --class \
        com.aliyun.odps.spark.examples.sparksql.SparkSQL \
        /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

Example of PySpark I/O by MaxCompute table

  • Detailed code
  • How to submit
    # For more information about the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --jars /path/to/odps-spark-datasource_2.11-3.3.8-public.jar \
        /path/to/MaxCompute-Spark/spark-2.x/src/main/python/spark_sql.py

PySpark writing to OSS example

  • Detailed code
  • How to submit
    # For more information about the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    # For more information about OSS configuration, see OSS access notes.
    
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --jars /path/to/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar \
        /path/to/MaxCompute-Spark/spark-2.x/src/main/python/spark_oss.py
    # Compile Spark-2.x to get spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar.

Example of supporting Spark Streaming Loghub

  • Detailed code
  • How to submit
    # For more information about the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --class \
        com.aliyun.odps.spark.examples.streaming.loghub.LogHubStreamingDemo \
        /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

Example of supporting Spark Streaming Datahub

  • Detailed code
  • How to submit
    # For more information about the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --class \
        com.aliyun.odps.spark.examples.streaming.datahub.DataHubStreamingDemo \
        /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

Example of supporting Spark Streaming Kafka

  • Detailed code
  • How to submit
    # For more information about the configuration of the environment variables in the spark-defaults.conf file, see Set up a Spark on MaxCompute development environment.
    cd $SPARK_HOME
    bin/spark-submit --master yarn-cluster --class \
        com.aliyun.odps.spark.examples.streaming.kafka.KafkaStreamingDemo \
        /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
Note For more information, see MaxCompute-Spark.