In a project, you can create Shell, Hive, Hive SQL, Spark, Spark SQL, Spark Shell, Spark Streaming, MapReduce, Sqoop, Pig, Flink, and Streaming SQL jobs.

Create a job

  1. Log on to the E-MapReduce console by using an Alibaba Cloud account.
  2. Click the Data Platform tab. The project list appears.
  3. Find the target project and click Edit Job in the Actions column. The Edit Job page appears.
  4. Right-click the folder where you want to create a job, and select Create Job.
  5. In the Create Job dialog box, set the Name and Description parameters, and select a type from the Job Type drop-down list.

    The job type cannot be changed after the job is created.

  6. Click OK.
    Note You can also right-click a folder and select Create Subfolder, Rename Folder, or Delete Folder to perform the corresponding operation.

Configure a job

For more information about how to configure different types of jobs, see Jobs. This section provides general guidance on how to set basic, advanced, and alert parameters for a job.
Note When you insert an Object Storage Service (OSS) path and select ossref as the path prefix, E-MapReduce downloads the OSS file to your cluster and add the file to a specified classpath.
  1. In the upper-right corner, click Job Settings. The Job Settings dialog box appears.
  2. On the Basic Settings tab, set basic parameters.
  3. After basic parameters are set, click the Advanced Settings tab to set advanced parameters.
  4. After advanced parameters are set, click the Alert Settings tab to set alert parameters.

Modes of submitting jobs

The spark-submit process, which is the launcher in the data development module, is used to submit Spark jobs. This process typically occupies more than 600 MB of memory. The Memory (MB) parameter in the Job Settings dialog box specifies the size of the memory allocated to the launcher.

The latest version of E-MapReduce supports the following modes of submitting jobs:
  • Header/Gateway Node: In this mode, the spark-submit process runs on the header node and is not monitored by YARN. The spark-submit process requests much memory. A large number of jobs consume many resources of the header node, which causes an unstable cluster.
  • Worker Node: In this mode, the spark-submit process runs on a worker node, occupies a YARN container, and is monitored by YARN. This can alleviate the resource usage on the header node.

In an E-MapReduce cluster, the memory consumed by a job instance includes the memory consumed by the launcher and the memory consumed by the job. For a Spark job, the memory consumed by the job is further divided into the memory consumed by the spark-submit module (not the process), the memory consumed by the driver, and the memory consumed by the executor. The process where the driver runs varies according to the mode in which Spark applications are launched in YARN.

  • If Spark applications are launched in the yarn-client mode, the driver runs in the same process as spark-submit. If you submit a job in the LOCAL mode, the process runs on the header node and is not monitored by YARN. If you submit a job in the YARN mode, the process runs on a worker node, occupies a YARN container, and is monitored by YARN.
  • If Spark applications are launched in the yarn-cluster mode, the driver runs in a separate process and occupies a YARN container. In this case, the driver and spark-submit run in different processes.

To sum up, the job submission mode determines whether the spark-submit process runs on the header or worker node, and whether the spark-submit process is monitored by YARN. Whether the driver and spark-submit run in the same process depends on the launching mode of Spark applications, which can be yarn-client or yarn-cluster.

Run a job

After a job is configured, you can click Run in the upper-right corner to run the job.

View operational logs

After you run the job, you can view operational logs on the Records tab at the bottom of the Temporary Queries page. Click Details for a record to go to the details page. On this page, you can view the job submission logs and YARN container logs.

FAQ

Q: What can I do if the disk capacity becomes insufficient because streaming jobs generate too many logs?

A: For streaming jobs such as Spark Streaming jobs, we recommend that you enable log rotation to avoid insufficient disk capacity caused by many logs of long-running jobs. To enable log rotation for a job, follow these steps:

  1. Log on to the E-MapReduce console by using an Alibaba Cloud account.
  2. Click the Data Platform tab. The project list appears.
  3. Find the target project and click Edit Job in the Actions column. The Edit Job page appears.
  4. In the upper-right corner, click Job Settings. The Job Settings dialog box appears.
  5. Click the Advanced Settings tab.
  6. Click Add in the Environment Variables section. Add the following environment variable:
    FLOW_ENABLE_LOG_ROLLING = true
  7. After the environment variable is added, restart the job.
    Note If a job has generated too many logs and you do not want to restart the job, run the echo > /path/to/log/dir/stderr command to clear the logs.