You can create jobs to develop tasks in a project. This topic describes job-related operations.

Background information

Prerequisites

A project is created. For more information, see Manage projects.

Create a job

  1. Go to the Data Platform tab.
    1. Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Data Platform tab.
  2. In the Projects section of the page that appears, find the project that you want to manage and click Edit Job in the Actions column.
  3. Create a job.
    1. In the Edit Job pane on the left side of the page that appears, right-click the folder on which you want to perform operations and select Create Job.
      Note You can also right-click the folder and select Create Subfolder, Rename Folder, or Delete Folder to perform the corresponding operation.
    2. In the Create Job dialog box, specify Name and Description, and then select a specific job type from the Job Type drop-down list.

      E-MapReduce (EMR) supports the following types of jobs in data development: Shell, Hive, Hive SQL, Spark, Spark SQL, Spark Shell, Spark Streaming, MapReduce, Sqoop, Pig, Flink, Streaming SQL, Presto SQL, and Impala SQL.

      Note After the job is created, you cannot change the type of the job.
    3. Click OK.
      After a job is created, you can configure and edit the job.

Configure a job

For more information about how to develop and configure each type of job, see Jobs. This section describes how to configure the parameters of a job on the Basic Settings, Advanced Settings, Shared Libraries, and Alert Settings tabs in the Job Settings panel.

  1. In the upper-right corner of the job page, click Job Settings.
  2. In the Job Settings panel, configure the parameters on the Basic Settings tab.
    Section and parameter Description
    Job Overview Name The name of the job.
    Job Type The type of the job.
    Retries The number of retries that are allowed if the job fails. The value of this parameter ranges from 0 to 5.
    Actions on Failures The action that you can perform if the job fails. Valid values:
    • Pause: Suspend the current workflow if the job fails.
    • Run Next Job: Continue to run the next job if the job fails.
    You can determine whether to turn on the Use Latest Job Content and Parameters switch based on your business requirements.
    • If you turn off this switch, a job instance is generated based on the original job content and parameters after you rerun a job that fails.
    • If you turn on this switch, a job instance is generated based on the latest job content and parameters after you rerun a job that fails.
    Description The description of the job. If you want to modify the description of the job, you can click Edit on the right side of this parameter.
    Resources The resources that are required to run the job, such as JAR packages and user-defined functions (UDFs). Click the Plus sign icon on the right side to add resources.

    Upload the resources to Object Storage Service (OSS) first. Then, you can add them to the job.

    Configuration Parameters The variables that you want to reference in the job script. You can reference a variable in your job script in the format of ${Variable name}.

    Click the Plus sign icon on the right side to add a variable in the key-value pair format. You can determine whether to select Password to hide the value based on your business requirements. The key indicates the name of the variable. The value indicates the value of the variable. In addition, you can configure a time variable based on the start time of scheduling. For more information, see Configure job time and date.

  3. Click the Advanced Settings tab and configure the parameters.
    Section Parameter and description
    Mode
    • Job Submission Node: the mode to submit the job. For more information, see Job submission modes. Valid values:
      • Worker Node: The job is submitted to YARN by using a launcher, and YARN allocates resources to run the job.
      • Header/Gateway Node: The job runs as a process on the allocated node.
    • Estimated Maximum Duration: the estimated maximum running duration of the job. Valid values: 0 to 10800. Unit: seconds.
    Environment Variables The environment variables that are used to run the job. You can also export environment variables from the job script.
    • Example 1: Configure a Shell job with the code echo ${ENV_ABC}. If you set the ENV_ABC variable to 12345, a value of 12345 is returned after you run the echo command.
    • Example 2: Configure a Shell job with the code java -jar abc.jar. Content of the abc.jar package:
      public static void main(String[] args) {System.out.println(System.getEnv("ENV_ABC"));}
      If you set the ENV_ABC variable to 12345, a value of 12345 is returned after you run the job. The effect of setting the ENV_ABC variable in the Environment Variables section is equivalent to running the following script:
      export ENV_ABC=12345
      java -jar abc.jar
    Scheduling Parameters The parameters used to schedule the job, including Queue, Memory (MB), vCores, Priority, and Run By. If you do not configure these parameters, the default settings of the Hadoop cluster are used.
    Note The Memory (MB) parameter specifies the memory quota for the launcher.
  4. Click the Shared Libraries tab.
    In the Dependent Libraries section, specify Libraries.

    Job execution depends on some library files related to data sources. EMR publishes the libraries to the repository of the scheduling center as dependency libraries. You must specify dependency libraries when you create a job. To specify a dependency library, enter its reference string, such as sharedlibs:streamingsql:datasources-bundle:2.0.0.

  5. Click the Alert Settings tab and configure the alert parameters.
    Parameter Description
    Execution Failed Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the job fails.
    Action on Startup Timeout Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the job startup times out.
    Job execution timed out. Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the job execution times out.

Add annotations

You can add annotations to job scripts to configure job parameters in data development. Add an annotation in the following format:
!!! @<Annotation name>: <Annotation content>
Note Do not indent the three exclamation points (!!!) that start an annotation. Add one annotation in a line.
The following table describes all annotations that are supported.
Annotation name Description Example
rem Adds a comment.
!!! @rem: This is a comment.
env Adds an environment variable.
!!! @env: ENV_1=ABC
var Adds a custom variable.
!!! @var: var1="value1 and \"one string end with 3 spaces\"   "
!!! @var: var2=${yyyy-MM-dd}
resource Adds a resource file.
!!! @resource: oss://bucket1/dir1/file.jar
sharedlibs Adds dependency libraries. This annotation is valid only in Streaming SQL jobs. Separate multiple dependency libraries with commas (,).
!!! @sharedlibs: sharedlibs:streamingsql:datasources-bundle:1.7.0,...
scheduler.queue Specifies the queue to which the job is submitted.
!!! @scheduler.queue: default
scheduler.vmem Specifies the memory required to run the job. Unit: MiB.
!!! @scheduler.vmem: 1024
scheduler.vcores Specifies the number of vCores required to run the job.
!!! @scheduler.vcores: 1
scheduler.priority Specifies the priority of the job. Valid values: 1 to 100.
!!! @scheduler.priority: 1
scheduler.user Specifies the user who submits the job.
!!! @scheduler.user: root
Notice
When you add annotations, take note of the following points:
  • Invalid annotations are automatically skipped. For example, an unknown annotation or an annotation whose content is in an invalid format will be skipped.
  • Job parameters specified in annotations take precedence over job parameters specified in the Job Settings panel. If a parameter is specified both in an annotation and in the Job Settings panel, the parameter setting specified in the annotation takes effect.

Run a job

  1. Run the job that you created.
    1. On the job page, click Run in the upper-right corner to run the job.
    2. In the Run Job dialog box, select a resource group and the cluster that you created.
    3. Click OK.
  2. View running details.
    1. Click the Log tab in the lower part of the job page to view the operational logs.
      Log
    2. Click the Records tab to view the execution records of the job instance.
    3. Click Details in the Action column of a job instance to go to the Scheduling Center tab. On this tab, you can view the details about the job instance.

Operations that you can perform on jobs

In the Edit Job pane, you can right-click a job and perform the operations that are described in the following table.
Operation Description
Clone Job Clones the configurations of a job to generate a new job in the same folder.
Rename Job Renames a job.
Delete Job Deletes a job. You can delete a job only if the job is not associated with a workflow or the associated workflow is not running or being scheduled.

Job submission modes

The spark-submit process, which is the launcher in a data development module, is used to submit Spark jobs. In most cases, this process occupies more than 600 MiB of memory. The Memory (MB) parameter in the Job Settings panel specifies the size of the memory allocated to the launcher.

The following table describes the modes in which jobs can be submitted in the latest version of EMR.
Job submission mode Description
Header/Gateway Node In this mode, the spark-submit process runs on the master node and is not monitored by YARN. The spark-submit process requests a large amount of memory. A large number of jobs consume many resources of the master node, which undermines cluster stability.
Worker Node In this mode, the spark-submit process runs on a core node, occupies a YARN container, and is monitored by YARN. This mode reduces the resource usage on the master node.
In an EMR cluster, the memory consumed by a job instance is calculated by using the following formula:
Memory consumed by a job instance = Memory consumed by the launcher + Memory consumed by a job that corresponds to the job instance
For a Spark job, the memory consumed by a job is calculated by using the following formula:
Memory consumed by a job = Memory consumed by the spark-submit logical module (not the process) + Memory consumed by the driver + Memory consumed by the executor
The process in which the driver runs varies based on the mode in which Spark applications are launched in YARN.
Launch mode of Spark application Process in which spark-submit and driver run Process description
yarn-client mode Submit a job in LOCAL mode. The driver runs in the same process as spark-submit. The process used to submit a job runs on the master node and is not monitored by YARN.
Submit a job in YARN mode. The process used to submit a job runs on a core node, occupies a YARN container, and is monitored by YARN.
yarn-cluster mode The driver runs in a different process from spark-submit. The driver occupies a YARN container.