This topic describes how to create an E-MapReduce (EMR) cluster.

Prerequisites

Authorization is completed in Resource Access Management (RAM). For more information, see Role authorization.

Procedure

  1. Go to the cluster creation page.
    1. Log on to the Alibaba Cloud E-MapReduce console.
    2. In the top navigation bar, select the region where you want to create a cluster. The region of a cluster cannot be changed after the cluster is created.
    3. Click Cluster Wizard in the Clusters section.
  2. Configure the cluster.

    To create a cluster, configure software, hardware, and basic parameters as guided by the wizard.

    Notice After a cluster is created, you cannot modify its parameters except for the cluster name. Make sure that all parameters are correct when you create a cluster.
    1. Configure software parameters.
      Parameter Description
      EMR Version

      The major version of EMR. The latest version is selected.

      Cluster Type The type of the cluster you want to create. EMR supports the following types of clusters:
      • Hadoop: Hadoop clusters provide multiple ecosystem components, such as Hadoop, Hive, Spark, Spark Streaming, Flink, Storm, Presto, Impala, Oozie, and Pig. Hadoop, Hive, and Spark are semi-hosted services and are used to store and compute large-scale distributed data offline. Spark Streaming, Flink, and Storm provide stream computing. Presto and Impala are used for interactive queries. For information about these components, see the Services section of the Status tab on the Clusters and Services page.
      • Kafka: Kafka clusters serve as a semi-hosted, distributed message system with high throughput and scalability. Kafka clusters provide a comprehensive service monitoring system that maintains cluster stability. Kafka clusters are professional, reliable, and secure. You do not need to deploy or maintain these clusters. These clusters are used in scenarios such as log collection and monitoring data aggregation. They can also be used for offline data processing, stream computing, and real-time data analysis.
      • ZooKeeper: ZooKeeper clusters provide a distributed and consistent lock service that facilitates coordination among large-scale Hadoop, HBase, and Kafka clusters.
      • Druid: Druid clusters provide a semi-hosted, real-time, and interactive analytic service. These clusters can query big data within milliseconds and ingest data in multiple ways. You can use Druid clusters with services such as EMR Hadoop, EMR Spark, Object Storage Service (OSS), and ApsaraDB for RDS to build a flexible and stable system for real-time queries.
      • Flink: Flink clusters support all features of the open-source Flink ecosystem and can be used with Alibaba Cloud OSS and other services.
      Required Services Default components required for a specific cluster type. After a cluster is created, you can add, start, or stop services on the cluster management page.
      Optional Services Other components you can specify as required. The relevant service processes for the components you specify are started by default.
      Note The more components you specify, the higher instance specifications a cluster needs to handle the components. You must select the instance type that matches the number of components you specified when you configure the hardware. Otherwise, the cluster may have insufficient resources to run the components.
      Advanced Settings
      • Kerberos Mode: specifies whether to enable Kerberos authentication for clusters. This mode is disabled by default because it is not required by clusters created for common users.
      • Custom Software Settings: customizes software settings. You can use a JSON file to customize the parameters of basic components required for a cluster, such as Hadoop, Spark, and Hive. For more information, see Software configuration. This feature is disabled by default.
    2. Configure hardware parameters.
      Section Parameter Description
      Billing Method Billing Method The billing method. Valid values are as follows:
      • Pay-As-You-Go: The system charges you for a cluster based on the hours the cluster is actually used. The fee is deducted on an hourly basis. The unit price of a pay-as-you-go cluster is higher than that of a subscription cluster of the same specification. We recommend that you use pay-as-you-go clusters for short-term test jobs or dynamically scheduled jobs.
      • Subscription: The system charges you for a cluster only once per subscription period. The unit price of a subscription cluster is lower than that of a pay-as-you-go cluster of the same specification. Subscription clusters are offered discounts, and longer subscription periods offer larger discounts.

        Subscription Period: the validity period of your subscription. You can get a 15% discount if you select a 12-month subscription period.

        Note

        We recommend that you create a pay-as-you-go cluster for a test run. If the cluster passes the test, you can create a subscription cluster for production.

      Network Settings Zone The zone where a cluster is created. Zones are different geographical areas located in the same region. They are interconnected by an internal network. In most cases, you can use the zone selected by default.
      Network Type

      The network type of the cluster. The Virtual Private Cloud (VPC) network type is selected by default. If no VPC is created, go to the VPC console to create one.

      VPC The VPC selected in the region where your cluster resides. If no VPC is available in the region, click Create VPC/VSwitch to create one.
      VSwitch The VSwitch selected in the zone where your cluster and VPC reside. If no VSwitch is available in the zone, create one.
      Security Group Name

      The security group to which you want to add your cluster. If this is your first time to use EMR, no security group exists. Enter a name to create a security group. If you created security groups in EMR, select one as required.

      Note The name must be 2 to 64 characters in length and can contain letters, digits, underscores (_), and hyphens (-). It must start with a letter.
      High Availability High Availability

      If the high availability mode is enabled, two master nodes are created in a Hadoop cluster to ensure the availability of the ResourceManager and NameNode processes. HBase clusters support high availability. In the original high availability mode, an HBase cluster uses a core node as the backup of the master node. If you enable the high availability mode described here, an HBase cluster uses two master nodes to achieve high availability, which is more secure and reliable.

      Instance Learn More
      • Master Instance: runs control processes, such as ResourceManager and NameNode.

        You can select an instance type as required. For more information, see Instance families.

        • System Disk Type: You can select an SSD, ESSD, or ultra disk based on your needs.
        • Disk Size: You can resize a disk based on your needs. The recommended minimum disk size is 120 GB.
        • Data Disk Type: You can select an SSD, ESSD, or ultra disk based on your needs.
        • Disk Size: You can resize a disk based on your needs. The recommended minimum disk size is 80 GB.
        • Master Nodes: One master node is configured by default.
      • Core Instance: stores all the data of a cluster. You can add core nodes as needed after a cluster is created.
        • System Disk Type: You can select an SSD, ESSD, or ultra disk based on your needs.
        • Disk Size: You can resize a disk based on your needs. The recommended minimum disk size is 120 GB.
        • Data Disk Type: You can select an SSD, ESSD, or ultra disk based on your needs.
        • Disk Size: You can resize a disk based on your needs. The recommended minimum disk size is 80 GB.
        • Core Nodes: Two core nodes are configured by default. You can change the number of core nodes as required.
      • Task Instance: stores no data but adjusts the computing capabilities of clusters. No task node is configured by default. You can add task nodes as required.
    3. Configure basic parameters.
      Section Parameter Description
      Basic Information Cluster Name The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
      Type Use User-created RDS is recommended.
      • Built-in MySQL: Metadata is stored in the local MySQL database of a cluster.
      • Unified Metabases: Metadata is stored in an external metadatabase and is retained even after a cluster is released. For more information, see Manage Hive metadata in a unified manner.
      • Use User-created RDS: Metadata is stored in a user-created ApsaraDB for RDS database.
      Assign Public Network IP Specifies whether an EIP address is associated with a cluster.
      Note When Type is set to Unified Metabases, Assign Public Network IP is disabled by default. This feature can be used only for Kafka clusters. After a cluster is created, you can access the cluster only over the internal network. To access the cluster over the Internet, apply for a public IP address on ECS. For information about how to apply for an EIP address, see Elastic IP Addresses.
      Remote Logon Specifies whether to enable port 22 of a security group. Port 22 is disabled by default.
      Key Pair For information about how to use a key pair, see SSH key pair overview.
      Password The password used to log on to a master node. The password must be 8 to 30 characters in length and contain uppercase letters, lowercase letters, digits, and special characters. The following special characters are supported:

      ! @ # $ % ^ & *

      Advanced Settings Add Knox User The user added to access the Web UIs of open-source big data software.
      Permission Settings The RAM roles that allow applications running in a cluster to access other Alibaba Cloud services. You can use the default RAM roles.
      • EMR Role: The value is fixed to AliyunEMRDefaultRole and cannot be changed. This RAM role authorizes a cluster to access other Alibaba Cloud services, such as ECS and OSS.
      • ECS Role: You can also assign an application role to a cluster. Then, EMR applies for a temporary AccessKey pair when applications running on the compute nodes of that cluster access other Alibaba Cloud services, such as OSS. This way, you do not need to manually enter an AccessKey pair. You can grant the application role access permissions on specific Alibaba Cloud services as required.
      Bootstrap Actions Optional. You can configure bootstrap actions to run custom scripts before a cluster starts Hadoop. For more information, see Bootstrap actions.
      Tag Optional. You can bind a tag when you create a cluster or bind a tag on the cluster details page after a cluster is created. For more information, see Cluster tags.
      Note The cluster configuration appears on the right side when you configure parameters. After you complete the preceding configuration and click Next: Confirm, you are directed to the Confirm step, in which you can confirm the configuration and the fee for the creation of your cluster. The fee varies with the billing method. For a pay-as-you-go cluster, the price per hour appears. For a subscription cluster, the total fee appears.
    4. Verify that the configuration is correct and click Create.
      Notice
      • Pay-as-you-go clusters: Creation immediately starts after you click Create. On the Cluster Management tab that appears, a cluster whose status is Initializing appears in the cluster list. It takes several minutes to create a cluster. After the cluster is created, its status changes to Idle.
      • Subscription clusters: An order is generated after you click Create. The cluster is created after you pay the fee.

      If a cluster fails to be created, Creation Failed appears in the Status column of the cluster. Move the pointer over the red exclamation point (!) to view the cause, as shown in the following figure.

      You do not need to handle creation failures. If a cluster fails to be created, no computing resources are created. The cluster is not displayed in the cluster list after three days.

  3. Optional:Log on to a target core node.
    1. On the master node, switch to the hadoop user by running the following command:
      su hadoop
    2. Log on to the target core node by using password-free SSH authentication.
      ssh emr-worker-1
    3. Run the following sudo command to obtain the root permissions:
      sudo vi /etc/hosts