This topic describes how to prepare for Spark development.

Install EMR SDK

You can use one of the following methods to install EMR SDK:
  • Use the JAR package in Eclipse. Follow these steps:
    1. Download the dependencies required by EMR from the Maven repository.
    2. Copy the required JAR package to your project directory. Select an SDK version based on the Spark version for the cluster in which your jobs run. Version 2.10 is recommended for Spark 1.X, and version 2.11 is recommended for Spark 2.X.
    3. Right-click the project name in Eclipse and select Properties. In the dialog box that appears, click Java Build Path and then click Add JARs.
    4. Select the JAR package you downloaded and click OK.

      After you complete the preceding steps, you can perform read/write operations on Object Storage Service (OSS), Log Service, Message Service (MNS), Message Queue (MQ), Tablestore, and MaxCompute data.

  • Create a Maven project and add the following dependencies:
        <dependency>
            <groupId>com.aliyun.emr</groupId>
            <artifactId>emr-tablestore</artifactId>
            <version>1.9.0</version>
        </dependency>
        <dependency>
            <groupId>com.aliyun.emr</groupId>
            <artifactId>emr-mns_2.11</artifactId>
            <version>1.9.0</version>
        </dependency>
        <dependency>
            <groupId>com.aliyun.emr</groupId>
            <artifactId>emr-logservice_2.11</artifactId>
            <version>1.9.0</version>
        </dependency>
        <dependency>
            <groupId>com.aliyun.emr</groupId>
            <artifactId>emr-maxcompute_2.11</artifactId>
            <version>1.9.0</version>
        </dependency>
        <dependency>
            <groupId>com.aliyun.emr</groupId>
            <artifactId>emr-ons_2.11</artifactId>
            <version>1.9.0</version>
        </dependency>
        <dependency>
            <groupId>com.aliyun.emr</groupId>
            <artifactId>emr-datahub_2.11</artifactId>
            <version>1.9.0</version>
        </dependency>

Debug Spark code locally

When you debug the Spark code that is used to read or write OSS data, you must configure a SparkConf object. Set spark.hadoop.mapreduce.job.run-local to true and retain the default values of other parameters. Example:
val conf = new SparkConf().setAppName(getAppName).setMaster("local[4]")
   conf.set("spark.hadoop.fs.oss.impl", "com.aliyun.emr.fs.oss.JindoOssFileSystem")
   conf.set("spark.hadoop.mapreduce.job.run-local", "true")
   val sc = new SparkContext(conf) 
   val data = sc.textFile("oss://...")
   println(s"count: ${data.count()}")

Usage notes

  • Configure third-party dependencies

    To allow EMR to access Alibaba Cloud data sources such as OSS and MaxCompute, you need to configure the required third-party dependencies for your Spark job.

    For more information about how to add or remove third-party dependencies, see the pom file.

  • Specify an OSS output directory

    Add the fs.oss.buffer.dirs parameter to the local Hadoop configuration file. Set this parameter to a local directory. If you do not specify this parameter, a null pointer exception occurs when you run a Spark job to write OSS data.

  • Clean up junk data

    When a Spark job fails, the system does not automatically delete the data generated by the job. You need to check the OSS output directory and clean up the generated files if any. Then, log on to the OSS console, click the target bucket, and clean up the parts if any on the Parts tab.

  • Use PySpark

    For information about how to create a PySpark job, see aliyun-emapreduce-sdk.