Usage instructions for demo projects

Last Updated: Jul 04, 2017

Demo projects

This project is a complete runnable project that can be compiled, including sample code of MapReduce, Pig, Hive and Spark. Review it at demo project. Details are as follows:

MapReduce

  • WordCount: Word count calculation.

Hive

  • sample.hive: Simple query of tables.

Pig

  • sample.pig: OSS data instances processed by Pig.

Spark

  • SparkPi: Pi calculation.

  • SparkWordCount: Word count calculation. [A1]

  • LinearRegression: Linear regression.

  • OSSSample: OSS usage sample.

  • ONSSample: ONS usage sample.

  • ODPSSample: ODPS usage sample.

  • MNSSample: MNS usage sample.

  • LoghubSample: Loghub usage sample.

Dependency resources

Testing data (in data directory):

  • The_Sorrows_of_Young_Werther.txt: It can be used as the input data of WordCount (MapReduce/Spark).

  • patterns.txt: Filtering characters of WordCount (MapReduce) jobs.

  • u.data: The testing table data of sample.hive script.

  • abalone: Testing data of linear regression algorithm.

Dependent jar package (in lib directory)

  • tutorial.jar: The dependent jar package for the sample.pig job.

Preparation

This project provides some testing data that you can upload to OSS for use. For other samples, such as LogService, you can refer to Create a project.

Basic concepts

  • OSSURI: oss://accessKeyId:accessKeySecret@bucket.endpoint/a/b/c.txt. Used for specifying input/output data sources, similar to hdfs://.

  • Alibaba Cloud AccessKeyId/AccessKeySecret is the API key you used to visit Alibaba Cloud. You can get it from here.

Run the cluster

  • Spark

    • SparkWordCount: spark-submit --class SparkWordCount examples-1.0-SNAPSHOT-shaded.jar <inputPath> <outputPath> <numPartition>

      Parameters are described as follows:

      • inputPath: Input data path.

      • outputPath: Output path.

      • numPartition: The number of RDD partitions of input data.

    • SparkPi: spark-submit --class SparkPi examples-1.0-SNAPSHOT-shaded.jar

    • OSSSample:spark-submit --class OSSSample examples-1.0-SNAPSHOT-shaded.jar <inputPath> <numPartition>

      Parameters are described as follows:

      • inputPath: Input data path.

      • numPartition: The number of RDD partitions of input data.

    • ONSSample: spark-submit --class ONSSample examples-1.0-SNAPSHOT-shaded.jar <accessKeyId> <accessKeySecret> <consumerId> <topic> <subExpression> <parallelism>

      Parameters are described as follows:

      • accessKeyId: Alibaba Cloud AccessKeyId.

      • accessKeySecret: Alibaba Cloud AccessKeySecret.

      • consumerId: ID of the consumer.

      • topic: Every message queue has a topic.

      • subExpression: Subexpression of the message.

      • parallelism: Specifies how many receivers are used to consume the queue message.

    • ODPSSample: spark-submit --class ODPSSample examples-1.0-SNAPSHOT-shaded.jar <accessKeyId> <accessKeySecret> <envType> <project> <table> <numPartitions>

      Parameters are described as follows:

      • accessKeyId: Alibaba Cloud AccessKeyId.

      • accessKeySecret: Alibaba Cloud AccessKeySecret.

      • envType: 0 indicates the Internet environment, and 1 indicates the Intranet environment. Select 0 for local debugging, and select 1 for execution on E-MapReduce.

      • project: See Spark + ODPS.

      • table: See Table.

      • numPartition: The number of RDD partitions of input data.

    • MNSSample: spark-submit --class MNSSample examples-1.0-SNAPSHOT-shaded.jar <queueName> <accessKeyId> <accessKeySecret> <endpoint>

      Parameters are described as follows:

      • queueName: The queue name.

      • accessKeyId: Alibaba Cloud AccessKeyId.

      • accessKeySecret: Alibaba Cloud AccessKeySecret.

      • endpoint: The access address of queue data.

    • LoghubSample: spark-submit --class LoghubSample examples-1.0-SNAPSHOT-shaded.jar <sls project> <sls logstore> <loghub group name> <sls endpoint> <access key id> <access key secret> <batch interval seconds>

      Parameters are described as follows:

      • sls project: LogService project name.

      • sls logstore: Logstore name.

      • loghub group name: The name of the group for consuming log data in the job. There is no restriction on the group name. When SLS project and SLS store are the same, jobs with the same group name will consume data in SLS store in a converted way. Jobs with different group names will consume data in SLS store in a mutually isolated way.

      • sls endpoint: See Service endpoint.

      • accessKeyId: Alibaba Cloud AccessKeyId.

      • accessKeySecret: Alibaba Cloud AccessKeySecret.

      • batch interval seconds: Intervals between Spark Streaming job batches. Unit: second.

    • LinearRegression: spark-submit --class LinearRegression examples-1.0-SNAPSHOT-shaded.jar <inputPath> <numPartitions>

      Parameters are described as follows:

      • inputPath: Input data.

      • numPartition: The number of RDD partitions of input data.

  • Mapreduce

    • WordCount: hadoop jar examples-1.0-SNAPSHOT-shaded.jar WordCount -Dwordcount.case.sensitive=true <inputPath> <outputPath> -skip <patternPath>

      Parameters are described as follows:

      • inputPathl: Input data path.

      • outputPath: Output data path.

      • patternPath: Filtering character file. You can use data/patterns.txt.

  • Hive

    • hive -f sample.hive -hiveconf inputPath=<inputPath>

      Parameters are described as follows:

      • inputPath: Input data path.
  • Pig

    • pig -x mapreduce -f sample.pig -param tutorial=<tutorialJarPath> -param input=<inputPath> -param result=<resultPath>

      Parameters are described as follows:

      • tutorialJarPath: Dependent jar package. You can use lib/tutorial.jar.

      • inputPath: Input data.

      • resultPath: Output path.

Note:

  1. - If you are using it on E-MapReduce, upload the testing data and dependent jar package to OSS. The path rule should follow the preceding definition of OSSURI.
  2. - If you are using it in the cluster, you can store it on the local machine.

Local running

You can rul the Spark program locally in order to visit Alibaba Cloud data sources, such as OSS. If you want to debug operations locally, we recommend you use development tools, such as Intellij IDEA or Eclipse, especially for Windows. Otherwise you will need to configure Hadoop and Spark running environments on Windows machines.

Intellij IDEA

Preparation

Install Intellij IDEA, Maven, Intellij IDEA Maven plug-in, Scala and Intellij IDEA Scala plug-in.

Development process

  1. Double click to enter SparkWordCount.scala.

  2. Enter the job configuration interface at the arrow in the following figure.

  3. Select SparkWordCount and input the job parameters in the job parameter box as needed.

    pic

  4. Click OK.

  5. Click Run.

  6. View the job execution log.

Scala IDE for Eclipse

Preparation

Install Scala IDE for Eclipse, Maven and Eclipse Maven plug-in.

Development process

  1. Import project as instructed in the following figure.

  2. Run As Maven build. The shortcut keys are “Alt + Shilft + X, M”. You can also right click the project name and select Run As > Maven build.

  3. After the compilation is completed, right click the job you want to run, and select Run Configuration to enter the configuration page.

  4. In the configuration page, select Scala Application and configure the Main Class and parameters of the job, as shown in the figure below:

  5. Click Run.

  6. View the output log of the Console, as shown in the following figure:

Thank you! We've received your feedback.