how to create, set up, and run a job - E-MapReduce - Alibaba Cloud Documentation Center

Background information

This topic covers the following:

Create a job
Configure a job
Add annotations to a job
Run a job
Operations available for jobs
Job submission modes

Prerequisites

You have created a project or have been added to a project. For more information, see Project management.

Create a job

Go to the Projects page in Data Platform.
1. Log in to the E-MapReduce console with your Alibaba Cloud account.
2. In the top navigation bar, select the region and resource group.
3. Click the Data Development tab.
In the row of the project you want to edit, click Edit Job.
Create a job.
1. In the left-side pane, right-click the target folder and select New job.
  
  Note
  You can also right-click a folder to perform other actions, such as New subfolder, Rename folder, and Delete Folder.
2. In the New job dialog box, enter the Job Name and Description, and select a job type from the Job Type drop-down list.
  
  Data development in Alibaba Cloud E-MapReduce supports the following job types: Shell, Hive, Hive SQL, Spark, SparkSQL, Spark Shell, Spark Streaming, MR, Sqoop, Pig, Flink, Streaming SQL, Presto SQL, and Impala SQL.
  
  Note
  The Job Type cannot be changed after the job is created.
3. Click OK.
  
  After you create the job, you can configure and edit it.

Configure a job

For more information about how to develop and configure different types of jobs, see jobs. This section describes the Basic Settings, Advanced Settings, shared library, and Alert Settings of a job.

On the Edit Job page, click Job Settings in the upper-right corner.

In the Job Settings panel, configure the basic information.

Parameter		Description
Job overview	Job Name	The name of the job you created.
	Job Type	The type of job you created.
	Retries After Failure	The number of times to retry a job after it fails. You can select a value from 0 to 5.
	Policy upon Failure	The policy to apply if a job fails. The following options are available: Pause current workflow: Pauses the current workflow if the job fails. Continue to run the next job: Continues to the next job in the workflow if the current job fails. Based on your business requirements, you can turn on or turn off the Use latest job content and parameters switch. Off: Retries use the original job content and parameters. On: Retries use the latest job content and parameters.
	Description	Click Edit on the right to modify the job description.
Runtime resources		Click the icon to add resources the job depends on, such as JAR packages or UDFs. You must first upload the resources to OSS and then add them in the Runtime resources section.
Configuration parameters		Specify values for variables that are referenced in the job code. You can reference variables in your code by using the ${variable_name} format. Click the icon to add a key and a value. You can choose whether to encrypt the value. The key is the variable name, and the value is the variable's value. You can also configure time variables based on the scheduled start time. For more information, see Configure job dates.

In the Job Settings panel, click the Advanced Settings tab.

Parameter	Description
Mode	Submission node: The node where the job is submitted. For more information, see job submission modes. The following modes are available: Submit on worker node: A launcher submits the job from a worker node after YARN allocates resources. Submit on header/gateway node: The job runs directly on the header/gateway node. Maximum expected runtime: 0 to 10,800 seconds.
Environment variables	Add environment variables for job execution. You can also export environment variables directly in the job script. Example 1: For a Shell job with the content `echo ${ENV_ABC}`, if you set the environment variable `ENV_ABC=12345`, the output of the `echo` command is `12345`. Example 2: For a Shell job with the content `java -jar abc.jar`, where the content of abc.jar is as follows: `public static void main(String[] args) {System.out.println(System.getEnv("ENV_ABC"));}` The returned result is `12345`. Setting the environment variable here is equivalent to running the following script: `export ENV_ABC=12345 java -jar abc.jar`
Scheduling Parameters	Configure scheduling information for the job, such as the YARN queue, memory, vCores, priority, and submitting user. If you do not set these parameters, they use the default values from the Hadoop cluster. Note The memory setting configures the memory quota for the launcher.

In the Job Settings panel, click the Shared library tab.

In the Dependent library area, specify the Databases.

Job execution may depend on certain data source-related library files. Alibaba Cloud E-MapReduce provides these as a shared library in the scheduling service repository. When you create a job, you must specify the version of the dependent library to use. For example: sharedlibs:streamingsql:datasources-bundle:2.0.0.

In the Job Settings panel, click the Alert Settings tab.

Parameter	Description
Failed	Sends a notification to a user alert group or a DingTalk alert group if the job fails.
Startup Timed Out	Sends a notification to a user alert group or a DingTalk alert group if the job startup times out.
On execution timeout	Sends a notification to a user alert group or a DingTalk alert group if the job execution times out.

Add annotations to a job

During data development, you can add job parameters by adding specific annotations to the job content. The annotation format is as follows:

!!! @<AnnotationName>: <AnnotationContent>

Note

The !!! characters must be at the beginning of the line with no indentation. Use one annotation per line.

The following annotations are supported.

Annotation name	Description	Example
rem	Indicates a line of comment.	`!!! @rem: This is a comment.`
env	Adds an environment variable.	`!!! @env: ENV_1=ABC`
var	Adds a custom variable.	`!!! @var: var1="value1 and \"one string end with 3 spaces\" " !!! @var: var2=${yyyy-MM-dd}`
resource	Adds a resource file.	`!!! @resource: oss://bucket1/dir1/file.jar`
sharedlibs	Adds a dependent library. This annotation is valid only for Streaming SQL jobs. If you specify multiple libraries, separate them with commas (,).	`!!! @sharedlibs: sharedlibs:streamingsql:datasources-bundle:1.7.0,...`
scheduler.queue	Sets the submission queue.	`!!! @scheduler.queue: default`
scheduler.vmem	Sets the requested memory, in MB.	`!!! @scheduler.vmem: 1024`
scheduler.vcores	Sets the requested number of vCores.	`!!! @scheduler.vcores: 1`
scheduler.priority	Sets the request priority. The value can range from 1 to 100.	`!!! @scheduler.priority: 1`
scheduler.user	Sets the submission username.	`!!! @scheduler.user: root`

Important

When using annotations, note the following:

The system automatically ignores invalid annotations.
Job parameters specified in annotations override parameters configured in the job settings.

Run a job

Run the job.
1. On the job page, click Run in the upper-right corner.
2. In the Run Job dialog box, select a resource group and an execution cluster.
3. Click OK.

View the job execution log.

After the job starts running, you can view its execution log on the Log tab.

2021-09-02 16:37:54.653 [main] INFO  c.a.e.f.a.j.l.impl.CommonShellJobLauncherImpl - [COMMAND][FJI-xxx] submit user: hadoop
2021-09-02 16:37:54.654 [main] INFO  c.a.e.f.a.j.l.impl.CommonShellJobLauncherImpl - [COMMAND][FJI-3xxx] envs(override): {EMR_FLOW_AGENT_JOB_ID=FJI-xxx, PATH=/mnt/disk3/yarn/usercache/hadoop/appcache/application_1630311773326_0022/container_1630311773326_0022_01_000001, EMR_FLOW_CLUSTER_ID=C-xxx, FLOW_SKIP_SQL_ANALYZE=false, EMR_FLOW_JOB_INSTANCE_ID="xxx", EMR_FLOW_NODE_INSTANCE_ID="xxx", EMR_FLOW_JOB_ID="xxx"}
2021-09-02 16:37:54.654 [main] INFO  c.a.e.f.a.j.l.impl.CommonShellJobLauncherImpl - [COMMAND][xxx] Executing command line: [bash, -c, echo 234]
2021-09-02 16:37:54.655 [main] INFO  c.a.e.f.a.j.l.impl.CommonShellJobLauncherImpl - [COMMAND][FJI-3xxx] Shell Executor type: com.aliyun.emr.flow.agent.common.shell.JavaShellExecutor.
===================JOB OUTPUT BEGIN===================
234
===================JOB OUTPUT END=====================
2021-09-02 16:37:55.159 [main] INFO  c.a.e.f.a.j.l.impl.CommonShellJobLauncherImpl - [COMMAND][FJI-3E78xxx] Finished command line, exit code=0.
Thu Sep 02 16:37:55 CST 2021 [JobLauncherRunner] INFO Closing job launcher ...
2021-09-02 16:37:55.161 [main] INFO  c.a.emr.flow.agent.jobs.launcher.JobLauncherBase - [FJI-3E788xxx] Closing ...
2021-09-02 16:37:55.162 [main] INFO  c.a.e.f.a.j.l.impl.CommonShellJobLauncherImpl - [FJI-3E788F6xxx] Stopping command executor ...
Thu Sep 02 16:37:55 CST 2021 [YarnJobLauncherAM] INFO Closing launcher am ...
2021-09-02 16:37:55.167 [main] INFO  o.a.hadoop.yarn.client.api.impl.AMRMClientImpl - Waiting for application to be successfully unregistered.
Thu Sep 02 16:37:55 CST 2021 [YarnJobLauncherAM] INFO Emr flow launcher is quit.
2021-09-02 16:37:55.385 [Shutdown-FJI-3E788F67373BAECD_0] INFO  c.a.emr.flow.agent.jobs.launcher.JobLauncherBase - [FJI-3E78xxx] Call shutdown hook.
2021-09-02 16:37:55.385 [Shutdown-FJI-3E788F67373BAECD_0] INFO  c.a.emr.flow.agent.jobs.launcher.JobLauncherBase - [FJI-378xxx] Closing ...
2021-09-02 16:37:55.385 [Shutdown-FJI-3E788F67373BAECD_0] INFO  c.a.emr.flow.agent.jobs.launcher.JobLauncherBase - [FJI-3E78xxx] This launcher is closed already, skip.
######END_OF_LOG######

Click the Execution Records tab to view the execution records of the job.
Click Details for a record to go to the O&M Center, where you can view detailed information about the job instance.

Operations available for jobs

In the Edit Job area, you can right-click a job name to perform the following operations.

Actions	Description
Clone Job	Clones the configuration of the current job to create a new job in the same folder.
Rename Job	Renames the job.
Delete job	You can delete a job only if it is not associated with any workflow, or if its associated workflow is not currently running or scheduled.

Job submission modes

The spark-submit process, which is the launcher in the data development module, is a Spark command used to submit Spark jobs. It typically consumes 600 MB of memory or more. The memory setting in the job settings panel configures the memory quota for the launcher.

The following two job submission modes are available:

Job submission mode	Description
Submit on header/gateway node	The spark-submit process runs on the header/gateway node and is not monitored by YARN. Because spark-submit consumes a large amount of memory, running too many jobs in this mode can strain the header/gateway node's resources and risk cluster-wide instability.
Submit on worker node	The spark-submit process runs on a worker node, occupies a YARN container, and is monitored by YARN. This mode can alleviate the resource load on the header/gateway node.

In an Alibaba Cloud E-MapReduce cluster, the memory consumed by a job instance is calculated as follows:

Memory consumed by a job instance = Memory consumed by the launcher + Memory consumed by the user job

For a Spark job, the memory consumed by the user job can be further broken down as follows:

Memory consumed by the job = Memory consumed by spark-submit (logical module, not process) + Memory consumed by the driver + Memory consumed by the executor

The physical memory location of the driver varies depending on the job configuration, as described in the following table.

Spark execution mode		spark-submit and driver	Process details
Yarn-Client mode	Job submission process uses LOCAL mode	The spark-submit and driver run in the same process.	The job submission process runs on the header/gateway node and is not monitored by YARN.
	Job submission process uses YARN mode		The job submission process runs on a worker node, occupies a YARN container, and is monitored by YARN.
Yarn-Cluster mode		The driver runs in a separate process from spark-submit.	The driver occupies a YARN container.