This topic describes how to log on to the Alibaba Cloud E-MapReduce (EMR) console by using your Alibaba Cloud account, create a Hadoop cluster by using the Quick Purchase method, and then create and run a job in the cluster.

Prerequisites

  • An Alibaba Cloud account is created, and real-name verification is complete.
  • The permissions of the default EMR and Elastic Compute Service (ECS) roles are granted to the EMR service. For more information, see Authorize roles.

Procedure

  1. Step 1: Create a cluster
    Create a Hadoop cluster by using the Quick Purchase method.
  2. Step 2: Create and run a job
    Create and run a Spark job in the Hadoop cluster by using the EMR console or the CLI.
  3. Step 3: View the details of the job
    View the details of the job in the Data Platform module of the EMR console or on the YARN web UI.
  4. Step 4: (Optional) Release the cluster
    If you no longer need to use the cluster, release the cluster to reduce costs.

Step 1: Create a cluster

  1. Go to the Cluster Management page.
    1. Log on to the Alibaba Cloud EMR console.
    2. In the top navigation bar, select the region in which you want to create a cluster and select a resource group based on your business requirements.
    3. Click the Cluster Management tab.
  2. On the Cluster Management page, click Quick Purchase in the upper-right corner.
  3. On the Quick Purchase page, configure the parameters.
    The following table describes the parameters used in this topic. If you use the Cluster Wizard method to create a cluster, you can view information about the parameters in Create a cluster.
    Section Parameter Example Description
    Basic Information Billing Method Pay-As-You-Go When you perform a test, we recommend that you use the pay-as-you-go billing method. After the test is complete, you can release the cluster and create a subscription cluster in the production environment.
    Cluster Type Hadoop The cluster type. Use the default value Hadoop.
    EMR Version EMR-5.3.1 The version of EMR. Select the latest version.
    Cluster Name Emr-Hadoop The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
    Password Custom password Remember the password. It is required when you log on to the cluster.
    Note If you use the Cluster Wizard method to create a cluster, we recommend that you use a key pair. Key pairs are more secure and easier to use than passwords. For more information about how to use key pairs, see Log on to the cluster by using an SSH key pair.
    Network Settings Zone China (Hangzhou) Zone I You cannot change the region or zone after the cluster is created. Exercise caution when you configure this parameter.
    Network Type VPC The value of this parameter can only be VPC.
    VPC VPC_Hangzhou(192.168.xx.xx/16)(ID: vpc-bp1f4epmkvncimpgs****) You must select a VPC in the specified region. If no VPC is available in the region, click Create VPC/VSwitch to create a VPC. After the VPC is created, click Refresh and select the created VPC.
    Note For more information about how to create a VPC, see Create a VPC. Make sure that the zone of the vSwitch is the same as the zone of the cluster.
    VSwitch vsw_i(192.168.xx.xx/24)(ID: vsw-bp1e2f5fhaplp0g6p****) The vSwitch of the cluster. Select a vSwitch in the specified zone. If no vSwitch is available in the zone, create a vSwitch.
    Security Group Name sg-bp1ddw7sm2risw****(ID:sg-bp1ddw7sm2riswb1****)
    Notice Do not use an advanced security group that is created in the ECS console.
    The security group to which you want to add your cluster. If this is the first time you use EMR, no security group exists. Enter a name to create a security group. If you have created security groups in EMR, you can select a security group based on your business requirements.
    High Availability High Availability Disabled High availability is disabled by default. For a Hadoop cluster, if high availability is enabled, two or three master nodes are created in the cluster to ensure the availability of the ResourceManager and NameNode processes.
    Instance Learn More Default configurations Configure Master Instance, Core Instance, and Task Instance based on your business requirements. For more information, see Select configurations.
  4. Read the terms of service, select E-MapReduce Service Terms, and then click Create.
    After you create the cluster, the information about the cluster is displayed on the Cluster Management page and the cluster is in the Initializing state. You can click the Details icon on the right of Initializing to view the progress. You can refresh the page to view the current status of the cluster. When the cluster enters the Idle state, the cluster is created.

Step 2: Create and run a job

After the cluster is created, you can create and run a job in the cluster. This section describes how to create and run a job by using the EMR console or the CLI. You can use one of the methods to create and run a job based on your business requirements.

Use the EMR console

  1. Create a project.
    1. Click the Data Platform tab.
    2. In the Projects section, click Create Project.
    3. In the Create Project dialog box, configure the parameters.
      Parameter Example Description
      Project Name project-dev The name of the project.
      Project Description Test project The description of the project.
      Select Resource Group default resource group Select an existing resource group from the Select Resource Group drop-down list.
    4. Click Create.
      In the Projects section, you can view and manage the created project.
  2. Create a job in the project.
    1. In the Projects section, find the project that you created, and click Edit Job in the Actions column.
      Edit Job
    2. In the Edit Job pane on the left, right-click the folder on which you want to perform operations and select Create Job.
      create job
      Note You can also right-click the folder to create a subfolder, rename the folder, or delete the folder.
    3. In the Create Job dialog box, configure the parameters.
      Parameter Example Description
      Name spark-demo The name of the job. We recommend that you customize a name that has a specific meaning to facilitate job management.
      Description Spark demo job The description of the job. We recommend that you describe the job based on the specific use scenario to facilitate job management.
      Job Type Spark Various job types are supported. In this example, Spark is used.
  3. Edit job content.
    1. In the Content field, enter the command that is used to submit the Spark job.
      In this example, Spark 3.1.1 is used, and the following command is used as the job content:
      --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 512m --num-executors 1 --executor-memory 1g --executor-cores 2 /usr/lib/spark-current/examples/jars/spark-examples_2.12-3.1.1.jar 10
      Job content
      Note spark-examples_2.12-3.1.1.jar is the JAR package in your cluster. You can log on to the cluster and obtain the package in the /usr/lib/spark-current/examples/jars directory.
    2. In the upper-right corner, click Save.
  4. Run the job.
    1. In the upper-right corner of the page, click Run.
    2. In the Run Job dialog box, select a resource group and the cluster that you created.
    3. Click OK.

Use the CLI

  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run a command in the CLI to submit and run a job.
    In this example, Spark 3.1.1 is used and the following command is used to submit and run a job:
    spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-memory 512m --num-executors 1 --executor-memory 1g --executor-cores 2 /usr/lib/spark-current/examples/jars/spark-examples_2.12-3.1.1.jar 10
    Note spark-examples_2.12-3.1.1.jar is the JAR package in your cluster. You can log on to the cluster and obtain the package in the /usr/lib/spark-current/examples/jars directory.
    The following figure shows the information that is returned after you submit the job. Use the CLI

Step 3: View the details of the job

After you submit the job, you can view the details of the job in the Data Platform module of the EMR console or on the YARN web UI.

Use the Data Platform module of the EMR console

This method is suitable for jobs that are created and submitted in the EMR console.

  1. After you submit the job, view the operational logs of the job on the Log tab.
  2. Click the Records tab to view the execution records of the job instance.
    Job execution records
  3. Find the record whose details you want to view and click Details in the Action column. On the Scheduling Center page, view the information about the job instance, job submission logs, and YARN containers.

Use the YARN web UI

This method is suitable for jobs that are created and submitted in the EMR console or CLI.

  1. Configure security group rules. For more information, see Configure security group rules.
  2. In the left-side navigation pane of the Clusters and Services page for the cluster, click Connect Strings.
  3. On the Public Connect Strings page, click the link for YARN UI.

    To access the YARN web UI by using your Knox account, you must obtain the username and password of the Knox account. For more information, see Manage user accounts.

    Yarn UI
  4. In the Hadoop console, click the ID of the job to view the details of the job.
    Hadoop console

Step 4: (Optional) Release the cluster

If you no longer need to use the cluster, you can release it to reduce costs. After you confirm the release of a cluster, the system performs the following operations on the cluster:

  1. Forcibly terminates all jobs in the cluster.
  2. Terminates and releases all ECS instances that are created for the cluster.

The time required to release a cluster is based on the size of the cluster. Most clusters can be released in seconds. It does not require more than 5 minutes to release a large cluster.

Notice
  • A pay-as-you-go cluster can be released at any time. A subscription cluster can be released only after the cluster expires.
  • Before you release a cluster, make sure that the cluster is in the Initializing, Running, or Idle state.
  1. On the Edit Job page, choose Back to EMR console > Cluster Management in the upper-right corner.
  2. On the Cluster Management page, find the cluster and choose More > Release in the Actions column.
    Alternatively, find the cluster and click Details in the Actions column. On the Cluster Overview page, choose Instance Status > Release in the upper-right corner.
  3. In the Cluster Management-Release message, click Release.