You can create jobs to develop tasks in a project. E-MapReduce (EMR) supports the following types of jobs in data development: Shell, Hive, Hive SQL, Spark, Spark SQL, Spark Shell, Spark Streaming, MapReduce, Sqoop, Pig, Flink, Streaming SQL, Presto SQL, and Impala SQL.

Prerequisites

A project is created. For more information, see Manage projects.

Create a job

  1. You have logged on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
  2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
  3. Click the Data Platform tab.
  4. In the Projects section of the page that appears, find the project that you want to edit and click Edit Job in the Actions column.
  5. In the Edit Job pane on the left, right-click the folder on which you want to perform operations and select Create Job.
  6. In the Create Job dialog box, specify Name and Description, and select a specific job type from the Job Type drop-down list.
    Note The job type cannot be changed after the job is created.
  7. Click OK.
    Note You can also right-click the folder and select Create Subfolder, Rename Folder, or Delete Folder to perform the corresponding operation.

Configure a job

For more information about how to develop and configure each type of job, see Jobs. This section describes how to configure parameters on the Basic Settings, Advanced Settings, Shared Libraries, and Alert Settings tabs in the Job Settings panel for a job.

  1. On the Ad Hoc Queries page, click Job Settings in the upper-right corner.
  2. In the Job Settings panel, configure parameters on the Basic Settings tab.
    Section Parameter and description
    Job Overview
    • Name: the name of the job.
    • Job Type: the type of the job.
    • Description: the description of the job. You can modify the description.
    Resources The resources required to run the job, such as JAR packages and user-defined functions (UDFs). Click the Plus sign icon on the right to add resources.

    Upload the resources to Object Storage Service (OSS) first. Then, you can add them to the job here.

    Configuration Parameters The variables you want to reference in the job script. You can reference a variable in your job script in the format of ${Variable name}.

    Click Plus sign on the right to add a variable with a key-value pair. The key indicates the name of the variable. The value indicates the value of the variable. In addition, you can configure a time variable based on the start time of scheduling. For more information, see Configure job time and date.

  3. Click the Advanced Settings tab and configure parameters.
    Section Parameter and description
    Mode
    • Job Submission: the job submission mode. For more information, see Edit jobs. Valid values:
      • Worker Node: The job is submitted to YARN by using a launcher, and YARN allocates resources to run the job.
      • Header/Gateway Node: The job runs on the allocated node as a process.
    • Estimated Maximum Duration: the estimated maximum running duration of the job. Valid values: 0 to 10800. Unit: seconds.
    Environment Variables The environment variables used to run the job. You can also export environment variables in the job script.
    • Example 1: Configure a Shell job with the code echo ${ENV_ABC}. Assume that you set the ENV_ABC variable to 12345. After you run the echo command, a value of 12345 is returned.
    • Example 2: Configure a Shell job with the code java -jar abc.jar. Content of the abc.jar package:
      public static void main(String[] args) {System.out.println(System.getEnv("ENV_ABC"));}
      Assume that you set the ENV_ABC variable to 12345. After you run the job, a value of 12345 is returned. The effect of setting the ENV_ABC variable in the Environment Variables section is equivalent to running the following script:
      export ENV_ABC=12345
      java -jar abc.jar
    Scheduling Parameters The parameters for scheduling the job, including Queue, Memory (MB), vCores, Priority, and Run By. If you do not specify these parameters, the default settings of the Hadoop cluster are used.
    Note Memory (MB) specifies the memory quota for the launcher.
  4. Click the Shared Libraries tab.
    In the Dependent Libraries section, specify Libraries.
  5. Click the Alert Settings tab and configure the alert parameters.
    Parameter Description
    Execution Failed Specifies whether to send a notification to a specific alert contact group or DingTalk alert group if the job fails.
    Action on Startup Timeout Specifies whether to send a notification to a specific alert contact group or DingTalk alert group if the job startup times out.
    Job execution timed out. Specifies whether to send a notification to a specific alert contact group or DingTalk alert group if the job execution times out.

Run a job

In the Edit Job pane, click a job. In the upper-right corner of the page, click Run.

View logs

After you run a job, view the operational log on the Records tab in the lower part of the job page.

Click Details in the Action column that corresponds to a job instance to go to the Scheduling Center tab. On this tab, you can view detailed information about the job instance.

Available operations on jobs

In the Edit Job pane, you can right-click a job name and perform the following operations:
  • Clone Job: Clone the configurations of a job that already exists in the same folder.
  • Rename Job: Rename a job.
  • Delete Job: Delete a job. You can delete a job only when the job is not associated with a workflow or the associated workflow is not running or being scheduled.

Job submission modes

The spark-submit process, which is the launcher in a data development module, is used to submit Spark jobs. This process typically occupies more than 600 MiB of memory. The Memory (MB) parameter in the Job Settings panel specifies the size of the memory allocated to the launcher.

The latest version of EMR supports the following job submission modes:
  • Header/Gateway Node: In this mode, the spark-submit process runs on the master node and is not monitored by YARN. The spark-submit process requests a large amount of memory. A large number of jobs consume many resources of the master node, which causes an unstable cluster.
  • Worker Node: In this mode, the spark-submit process runs on a core node, occupies a YARN container, and is monitored by YARN. This mode reduces the resource usage on the master node.
In an EMR cluster, the memory consumed by a job instance consists of the memory consumed by the launcher and the memory consumed by a specific job. For a Spark job, the memory consumed by a job is further divided into the memory consumed by the spark-submit module (not the process), the memory consumed by the driver, and the memory consumed by the executor. The process where the driver runs varies based on the mode in which Spark applications are launched in YARN.
  • If Spark applications are launched in yarn-client mode, the driver runs in the same process as spark-submit. If you submit a job in LOCAL mode, the process runs on the master node and is not monitored by YARN. If you submit a job in YARN mode, the process runs on a core node, occupies a YARN container, and is monitored by YARN.
  • If Spark applications are launched in yarn-cluster mode, the driver runs in a separate process and occupies a YARN container. In this case, the driver and spark-submit run in different processes.

To sum up, the job submission mode determines whether the spark-submit process runs on the master or core node, and whether the spark-submit process is monitored by YARN. Whether the driver and spark-submit run in the same process depends on the launching mode of Spark applications, which can be yarn-client or yarn-cluster.

Add annotations

You can add annotations to job scripts to configure job parameters in data development. Add an annotation in the following format:
!!! @<Annotation name>: <Annotation content>
Note !!!Do not indent the three exclamation points (!!!) that start an annotation. Add one annotation in a line.
The following table lists all supported annotations.
Annotation name Description Example
rem Adds a comment.
!!! @rem: This is a comment.
env Adds an environment variable.
!!! @env: ENV_1=ABC
var Adds a custom variable.
!!! @var: var1="value1 and \"one string end with 3 spaces\"   "
!!! @var: var2=${yyyy-MM-dd}
resource Adds a resource file.
!!! @resource: oss://bucket1/dir1/file.jar
sharedlibs Adds dependency libraries. This annotation is valid only in Streaming SQL jobs. Separate multiple dependency libraries with commas (,).
!!! @sharedlibs: sharedlibs:streamingsql:datasources-bundle:1.7.0,...
scheduler.queue Specifies the queue to which the job is submitted.
!!! @scheduler.queue: default
scheduler.vmem Specifies the memory required to run the job. Unit: MiB.
!!! @scheduler.vmem: 1024
scheduler.vcores Specifies the number of vCores required to run the job.
!!! @scheduler.vcores: 1
scheduler.priority Specifies the priority of the job. Valid values: 1 to 100.
!!! @scheduler.priority: 1
scheduler.user Specifies the user who submits the job.
!!! @scheduler.user: root
Notice
When you add annotations, take note of the following points:
  • Invalid annotations are automatically skipped. For example, an unknown annotation or an annotation whose content is in an invalid format will be skipped.
  • Job parameters specified in annotations take precedence over job parameters specified in the Job Settings panel. If a parameter is specified both in an annotation and in the Job Settings panel, the parameter setting specified in the annotation takes effect.