This sample project is a complete, compilable, and runnable project, including the sample code of MapReduce, Pig, Hive, and Spark.

Sample project

See the open source project. The details are as follows:
  • MapReduce

    WordCount: Counts the number of words.

  • Hive

    sample.hive: A simple search for the tables.

  • Pig

    sample.pig: OSS data instances processed by Pig.

  • Spark
    • SparkPi: Calculates Pi.
    • SparkWordCount: Counts the number of words.
    • LinearRegression: Linear regression.
    • OSSSample: OSS sample.
    • ODPSSample: ODPS sample.
    • MNSSample: MNS sample.
    • LoghubSample: Loghub sample.

Dependencies

  • Test data (under the data directory)
    • The_Sorrows_of_Young_Werther.txt: Used as the input data of WordCount (MapReduce/Spark).
    • patterns.txt: Filter characters of the WordCount (MapReduce) job.
    • u.data: The data of the test table of the sample.hive script.
    • abalone: Test data for linear regression algorithms.
  • The JAR dependencies (under the lib directory)

    tutorial.jar: The JAR dependencies required for the sample.pig job.

Preparations

This project provides some test data. You can simply upload it to OSS to use it. For other samples, such as MaxCompute, MNS, ONS, and Log Service, you need to prepare the data as follows:

Concepts

  • OSSURI: oss://accessKeyId:accessKeySecret@bucket.endpoint/a/b/c.txt. Specify the input and output data sources. OSSURI is similar to URLs like hdfs://.

  • The combination of AccessKey ID and AccessKey Secret is the key for you to access an Alibaba Cloud API. Click here to obtain it.

Run jobs in clusters

  • Spark

    • SparkWordCount:
      spark-submit --class SparkWordCount examples-1.0-SNAPSHOT-shaded.jar <inputPath>
                      <outputPath> <numPartition>

      Parameters are described as follows:

      • inputPath: Data input path.

      • outputPath: Data output path.

      • numPartition: The number of RDD partitions of the input data.

    • SparkPi: spark-submit --class SparkPi examples-1.0-SNAPSHOT-shaded.jar

    • OSSSample:
      spark-submit --class OSSSample examples-1.0-SNAPSHOT-shaded.jar <inputPath>
                      <numPartition>

      Parameters are described as follows:

      • inputPath: Data input path.

      • numPartition: Input the number of data RDD partitions.

    • ONSSample:
      spark-submit --class ONSSample examples-1.0-SNAPSHOT-shaded.jar <accessKeyId>
                      <accessKeySecret> <consumerId> <topic> <subExpression> <parallelism>

      Parameters are described as follows:

      • accessKeyId: Alibaba Cloud AccessKey ID.

      • accessKeySecret: Alibaba Cloud AccessKey Secret.

      • consumerId: See Consumer ID description.

      • topic: Each message queue has a topic.

      • subExpression: See Message filtering.

      • parallelism: Specifies the number of receivers to consume messages in the queue.

    • ODPSSample:
      spark-submit --class ODPSSample examples-1.0-SNAPSHOT-shaded.jar <accessKeyId>
                      <accessKeySecret> <envType> <project> <table> <numPartitions>

      Parameters are described as follows:

      • accessKeyId: Alibaba Cloud AccessKey ID.

      • accessKeySecret: Alibaba Cloud AccessKey Secret.

      • envType: 0 indicates the public network, and 1 indicates the private network. Select 0 for local debugging, and select 1 for execution on E-MapReduce.

      • project: see ODPS Quick Start.

      • numPartition: The number of RDD partitions of the input data.

    • MNSSample:
      spark-submit --class MNSSample examples-1.0-SNAPSHOT-shaded.jar <queueName>
                      <accessKeyId> <accessKeySecret> <endpoint>

      Parameters are described as follows:

      • queueName: Queue name. See MNS glossary.

      • accessKeyId: Alibaba Cloud AccessKey ID.

      • accessKeySecret: Alibaba Cloud AccessKey Secret.

      • endpoint: Address for accessing the queue data.

    • LoghubSample:
      spark-submit --class LoghubSample examples-1.0-SNAPSHOT-shaded.jar <sls project> <sls
                      logstore> <loghub group name> <sls endpoint> <access key id> <access key secret> <batch
                      interval seconds>

      Parameters are described as follows:

      • sls project: The name of the Log Service project.

      • sls logstore: Logstore name.

      • loghub group name: The name of the group that consumes Logstore data in jobs. You can specify a name as needed. When the values of sls project and sls store are the same, jobs with the same group name consume data in sls store collaboratively. Jobs with different group names consume data in sls store separately.

      • sls endpoint: See Service endpoint.

      • accessKeyId: Alibaba Cloud AccessKey ID.

      • accessKeySecret: Alibaba Cloud AccessKey Secret.

      • batch interval seconds: The interval between Spark Streaming jobs (unit: seconds).

    • LinearRegression:
      spark-submit --class LinearRegression examples-1.0-SNAPSHOT-shaded.jar <inputPath>
                      <numPartitions>

      Parameters are described as follows:

      • inputPath: Data input path.

      • numPartition: The number of RDD partitions of the input data.

  • MapReduce

    • WordCount:
      hadoop jar examples-1.0-SNAPSHOT-shaded.jar WordCount
                      -Dwordcount.case.sensitive=true <inputPath> <outputPath> -skip <patternPath>

      Parameters are described as follows:

      • inputPath: Data input path.

      • outputPath: Data output path.

      • patternPath: The file that contains characters to be filtered. You can use data/patterns.txt.

  • Hive

    • hive -f sample.hive -hiveconf inputPath=<inputPath>

      Parameters are described as follows:

      • inputPath: Data input path.
  • Pig

    • pig -x mapreduce -f sample.pig -param tutorial=<tutorialJarPath> -param
                      input=<inputPath> -param result=<resultPath>

      Parameters are described as follows:

      • tutorialJarPath: The JAR dependencies. You can use lib/tutorial.jar.

      • inputPath: Data input path.

      • resultPath: Data output path.

Notice
  • If you are working on E-MapReduce, upload the testing data and the JAR dependencies to OSS. The path rule should follow the definition of OSSURI, as described above.
  • If used in a cluster, it can be stored locally.

Run Locally

Here we describe how to run a Spark program locally to visit Alibaba Cloud's data sources, such as OSS. If you want to debug and run the program locally, we recommend that you use some development tools, such as IntelliJ IDEA or Eclipse, especially when using Windows. Otherwise, you need to configure Hadoop and Spark runtime environments on Windows machines.
  • IntelliJ IDEA
    • Preparations

      Install IntelliJ IDEA, Maven, Maven plugin for IntelliJ IDEA, Scala and Scala plugin for IntelliJ IDEA.

    • Development process
      1. Double-click to enter SparkWordCount.scala.

      2. Enter the job configuration page from the arrow as shown in the following figure.

      3. Select SparkWordCount and enter the required job parameters in the job parameter box.

      4. Click OK.
      5. Click Run to run the job.

      6. View job logs

  • Scala IDE for Eclipse
    • Preparations

      Install the Scala IDE for Eclipse, Maven and the Maven plugin for Eclipse.

    • Development process
      1. Import a project as described in the figure below:



      2. The shortcut for Run as Maven build is Alt + Shift + X, M. You can also right-click the project name and choose Run As > Maven build.
      3. Right-click on the job to run after it has been compiled, select Run Configuration to enter the configuration page.
      4. In the configuration page, select Scala Application and configure the Main Class and parameters of the job, As shown in the following figure:

      5. Click Run.
      6. View the output log of the console, as shown in the following figure: