Development preparation

Last Updated: Apr 19, 2017

Install E-MapReduce SDK

You can install E-MapReduce SDK using one of the methods below.

Method 1: Directly use the JAR package in the Eclipse, and the steps are as follows:

  1. Download E-MapReduce SDK from Alibaba Cloud official website.

  2. Unzip the package and copy “emr-sdk_2.10-1.1.3.1.jar” and “emr-core-1.1.3.1.jar” to your project folder.

  3. In the Eclipse, right click the project name and click Properties > Java Build Path > Add JARs.

  4. Select the SDK you downloaded.

  5. Thus, you can read and write data of OSS, LogService, MNS, ONS and ODPS in the project.

Method 2: Follow the Maven project method and add the dependency below:

  1. <!-- Support OSS data source -->
  2. <dependency>
  3. <groupId>com.aliyun.emr</groupId>
  4. <artifactId>emr-core</artifactId>
  5. <version>1.1.3.1</version>
  6. </dependency>
  7. <!-- Support MNS, ONS, LogService and ODPS data source -->
  8. <dependency>
  9. <groupId>com.aliyun.emr</groupId>
  10. <artifactId>emr-sdk_2.10</artifactId>
  11. <version>1.1.3.1</version>
  12. </dependency>

Local debugging of Spark code

Note: The configuration item of “spark.hadoop.mapreduce.job.run-local” only applies to the scenario where local debugging of Spark code for reading/writing OSS data is required. For other scenarios, you only need to apply the default settings.

For local debugging of Spark code for reading/writing OSS data, you need to configure the SparkConf and set “spark.hadoop.mapreduce.job.run-local” to “true”, such as the code below:

  1. val conf = new SparkConf().setAppName(getAppName).setMaster("local[4]")
  2. conf.set("spark.hadoop.fs.oss.impl", "com.aliyun.fs.oss.nat.NativeOssFileSystem")
  3. conf.set("spark.hadoop.mapreduce.job.run-local", "true")
  4. val sc = new SparkContext(conf)
  5. val data = sc.textFile("oss://...")
  6. println(s"count: ${data.count()}")

Instructions of the third-party dependency

To support operations on Alibaba data sources (including OSS and ODPS) on E-MapReduce, your job may need to be dependent on some third-party packages.

You can refer to this pom file to add or delete the required third-party dependency packages.

Garbage cleaning

If a Spark job fails, the generated data can not be cleared automatically. In case of a failed Spark job, check whether the OSS output directory contains files. You also need to check the OSS Fragment Management to see whether there are any uncommitted fragments. If yes, clear the fragments in a timely manner.

Thank you! We've received your feedback.