This topic describes how to configure a Spark job.

Prerequisites

A project is created. For more information, see Manage projects.

Procedure

  1. Go to the Data Platform tab.
    1. Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Data Platform tab.
  2. In the Projects section, find your project and click Edit Job in the Actions column.
  3. Create a Spark job.
    1. In the Edit Job pane on the left, right-click the folder on which you want to perform operations and select Create Job.
    2. In the Create Job dialog box, specify Name and Description, and select Spark from the Job Type drop-down list.
      This option indicates that a Spark job will be created. You can use the following command syntax to submit a Spark job:
      spark-submit [options] --class [MainClass] xxx.jar args
    3. Click OK.
  4. Edit job content.
    1. Specify the command line parameters required to submit the job in the Content field.
      Only the parameters that follow spark-submit are required.
      The following examples demonstrate how to specify the parameters required to submit Spark and PySpark jobs.
      • Create a Spark job.
        Create a Spark job named Wordcount. Parameter configuration example:
        • Enter the following command in the command line:
          spark-submit --master yarn-client --driver-memory 7G --executor-memory 5G --executor-cores 1 --num-executors 32 --class com.aliyun.emr.checklist.benchmark.SparkWordCount emr-checklist_2.10-0.1.0.jar oss://emr/checklist/data/wc oss://emr/checklist/data/wc-counts 32
        • Enter the following command in the Content field:
          --master yarn-client --driver-memory 7G --executor-memory 5G --executor-cores 1 --num-executors 32 --class com.aliyun.emr.checklist.benchmark.SparkWordCount ossref://emr/checklist/jars/emr-checklist_2.10-0.1.0.jar oss://emr/checklist/data/wc oss://emr/checklist/data/wc-counts 32
          Notice If a job is stored in OSS as a JAR package, you can reference the JAR package by using the ossref://emr/checklist/jars/emr-checklist_2.10-0.1.0.jar path. Click + Enter an OSS path in the lower part of the page. In the OSS File dialog box, set File Prefix to OSSREF and specify File Path. The system automatically completes the path of the Spark script in OSS.
      • Create a PySpark job.
        In addition to Scala and Java Spark jobs, you can create Python Spark jobs in EMR. Create a PySpark job named Python-Kmeans. Parameter configuration example:
        --master yarn-client --driver-memory 7g --num-executors 10 --executor-memory 5g --executor-cores 1 ossref://emr/checklist/python/kmeans.py oss://emr/checklist/data/kddb 5 32
        Notice
        • Python script resources can be referenced by using the ossref protocol.
        • The Python toolkit cannot be installed by using a PySpark job.
    2. Click Save.