This topic describes how to create an E-MapReduce (EMR) cluster.

Prerequisites

Authorization is completed in Resource Access Management (RAM). For more information, see Authorize roles.

Procedure

  1. Go to the cluster creation page.
    1. Log on to the Alibaba Cloud EMR console.
    2. In the top navigation bar, select the region where you want to create a cluster and select a resource group based on your business requirements.
      • The region of a cluster cannot be changed after the cluster is created.
      • All resources under your account are displayed by default.
    3. Click Cluster Wizard in the Clusters section.
  2. Configure the cluster.
    To create a cluster, you must configure software, hardware, and basic parameters as guided by the wizard.
    Notice After a cluster is created, you cannot modify its parameters except for the cluster name. Make sure that all parameters are correct when you create a cluster.
    1. Configure software parameters.
      Parameter Description
      Cluster Type The type of the cluster you want to create. EMR supports the following types of clusters:
      • Hadoop:
        • Provides Hadoop, Hive, and Spark components that serve as semi-hosted services and are used to store and compute large-scale distributed data offline.
        • Provides Spark Streaming, Flink, and Storm components for stream computing.
        • Provides Presto and Impala components for interactive queries.
        • Provides other Hadoop ecosystem components, such as Oozie and Pig.
      • Kafka: Kafka clusters serve as a semi-hosted, distributed message system with high throughput and scalability. Kafka clusters provide a comprehensive service monitoring system and a metadata management mechanism. These clusters are used in scenarios such as log collection and monitoring data aggregation. They can also be used for offline data processing, stream computing, and real-time data analysis.
      • Druid: Druid clusters provide a semi-hosted, real-time, and interactive analytic service. These clusters can query big data within milliseconds and ingest data in multiple ways. You can use Druid clusters with services such as EMR Hadoop, EMR Spark, Object Storage Service (OSS), and ApsaraDB RDS to build a flexible and stable system for real-time queries.
      • Data Science: Data Science clusters are commonly used in big data+AI scenarios. Data Science clusters support the offline extract, transform, load (ETL) of big data based on Hive and Spark, and TensorFlow model training. You can choose the CPU+GPU heterogeneous computing framework and deep learning algorithms supported by NVIDIA GPUs to run computing jobs more efficiently.
      • Dataflow: Dataflow clusters provide an enterprise-level big data processing platform. The platform is developed based on EMR Hadoop and the Ververica Platform powered by Apache Flink, is compatible with open source Flink APIs, and provides additional business value-added capabilities.
      Cloud Native Option on ECS is selected by default.
      EMR Version The major version of EMR. The latest version is selected by default.
      Required Services The default components required for a specific cluster type. After a cluster is created, you can start or stop components on the cluster management page.
      Optional Services The other components that you can specify as required. The relevant service processes for the components you specify are started by default.
      Note The more components you specify, the higher instance specifications a cluster requires to handle the components. You must select the instance type that matches the number of components you specified when you configure the hardware. Otherwise, the cluster may have insufficient resources to run the components.
      Advanced Settings
      • Kerberos Mode: specifies whether to enable Kerberos authentication for clusters. This feature is disabled by default. It is not required by clusters created for common users.
      • Custom Software Settings: customizes software settings. You can use a JSON file to customize the parameters of the basic components required for a cluster, such as Hadoop, Spark, and Hive. For more information, see Software configuration. This feature is disabled by default.
    2. Configure hardware parameters.
      Section Parameter Description
      Billing Method Billing Method Subscription is selected by default. EMR supports the following billing methods:
      • Pay-As-You-Go: a billing method that allows you to pay for an instance after you use the instance. The system charges you for a cluster based on the hours the cluster is actually used. You are charged on an hourly basis. We recommend that you use pay-as-you-go clusters for short-term test jobs or dynamically scheduled jobs.
      • Subscription: a billing method that allows you to use an instance only after you pay for the instance.
        Note

        We recommend that you create a pay-as-you-go cluster for a test run. If the cluster passes the test, you can create a subscription cluster for production.

      Network Settings Zone The zone where you want to create a cluster. Zones are different geographical areas located in the same region. They are interconnected by an internal network. In most cases, you can use the zone selected by default.
      Network Type The network type of the cluster. The VPC network type is selected by default.
      VPC The VPC where you want to deploy the cluster. Select a VPC in the same region as the zone. If no VPC is available in the region, click Create VPC/VSwitch to create one.
      VSwitch The vSwitch of the cluster. Select a vSwitch in the specified zone. If no vSwitch is available in the zone, create one.
      Security Group Name
      Notice Do not use an advanced security group that is created in the ECS console.
      The security group to which you want to add your cluster. If this is your first time to use EMR, no security group exists. Enter a name to create a security group. If you have created security groups in EMR, select one as required.
      High Availability High Availability This feature is disabled by default. For a Hadoop cluster, if high availability is enabled, two or three master nodes are created in the cluster to ensure the availability of the ResourceManager and NameNode processes.

      HBase clusters always work in high availability mode. If you do not enable high availability, only one master node is created, but a core node is used to support high availability. If you enable high availability, two master nodes are created to ensure higher security and reliability.

      Instance Learn More
      • Master Instance: runs control processes, such as ResourceManager and NameNode.
        You can select an instance type as required. For more information, see Instance families.
        • System Disk Type: You can select an SSD, ESSD, or ultra disk based on your needs.
        • Disk Size: You can resize a disk based on your needs. The recommended minimum disk size is 120 GB. Valid values: 40 to 2048. Unit: GB.
        • Data Disk Type: You can select an SSD, ESSD, or ultra disk based on your needs.
        • Disk Size: You can resize a disk based on your needs. The recommended minimum disk size is 80 GB. Valid values: 40 to 32768. Unit: GB.
        • Master Nodes: One master node is configured by default. If high availability is enabled, two or three master nodes are configured.
      • Core Instance: stores all the data of a cluster. You can add core nodes as needed after a cluster is created.
        • System Disk Type: You can select an SSD, ESSD, or ultra disk based on your needs.
        • Disk Size: You can resize a disk based on your needs. The recommended minimum disk size is 120 GB.
        • Data Disk Type: You can select an SSD, ESSD, or ultra disk based on your needs.
        • Disk Size: You can resize a disk based on your needs. The recommended minimum disk size is 80 GB.
        • Core Nodes: Two core nodes are configured by default. You can change the number of core nodes as required.
      • Task Instance: stores no data. It is used to adjust the computing capabilities of clusters. No task node is configured by default. You can add task nodes as required.
    3. Configure basic parameters.
      Section Parameter Description
      Basic Information Cluster Name The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
      Type
      • Data Lake Metadata: Metadata is stored in a data lake. If you have not activated Data Lake Formation, click activate it first to activate the service.
      • Built-in MySQL: Metadata is stored in the local MySQL database of a cluster.
      • Unified Metabases: Metadata is stored in an external metadatabase and is retained even after a cluster is released. For more information, see Manage Hive metadata in a centralized manner.
      • Independent ApsaraDB RDS for MySQL: Metadata is stored in an ApsaraDB for RDS database. For more information, see Configure an independent ApsaraDB for RDS instance.
      We recommend that you set this parameter to Independent ApsaraDB RDS for MySQL.
      Assign Public IP Address Specifies whether an EIP address is associated with the cluster. By default, this feature is disabled.
      Note If Type is set to Unified Metabases, Assign Public IP Address is disabled by default. This feature can be used only for Kafka clusters. After a cluster is created, you can access the cluster only over the internal network. To access the cluster over the Internet, apply for a public IP address on ECS. For information about how to apply for an EIP address, see Elastic IP Addresses.
      Remote Logon Specifies whether to enable port 22 of a security group. Port 22 is disabled by default.
      Key Pair For information about how to use a key pair, see SSH key pair overview.
      Password The password used to log on to a master node. The password must be 8 to 30 characters in length and contain uppercase letters, lowercase letters, digits, and special characters.

      The following special characters are supported: ! @ # $ % ^ & *

      Advanced Settings Add User The user added to access the web UIs of open source big data software.
      Permission Settings The RAM roles that allow applications running in a cluster to access other Alibaba Cloud services. You can use the default RAM roles.
      • EMR Role: The value is fixed to AliyunEMRDefaultRole and cannot be changed. This RAM role authorizes a cluster to access other Alibaba Cloud services, such as ECS and OSS.
      • ECS Role: You can also assign an application role to a cluster. Then, EMR applies for a temporary AccessKey pair when applications running on the compute nodes of that cluster access other Alibaba Cloud services, such as OSS. This way, you do not need to manually enter an AccessKey pair. You can grant the access permissions of the application role on specific Alibaba Cloud services as required.
      Bootstrap Actions Optional. You can configure bootstrap actions to run custom scripts before a cluster starts Hadoop. For more information, see Bootstrap actions.
      Tag Optional. You can bind a tag when you create a cluster or bind a tag on the cluster details page after a cluster is created. For more information, see Manage cluster tags.
      Resource Group Optional. For more information, see Use resource groups.
      Note The cluster configurations appear on the right side of the page when you configure parameters. After you complete the configurations, click Next: Confirm. You are directed to the Confirm step, in which you can confirm the configurations and the fee for the creation of your cluster. The fee varies based on the billing method.
    4. Verify that the configuration is correct and click Create.
      Notice
      • Pay-as-you-go clusters: Creation immediately starts after you click Create.

        After the cluster is created, its status changes to Idle.

      • Subscription clusters: An order is generated after you click Create. The cluster is created after you pay the fee.
  3. Optional:Log on to a required core node.
    1. On the master node, run the following command to switch to the hadoop user:
      su hadoop
    2. Log on to the destination core node by using password-free SSH authentication.
      ssh emr-worker-1
    3. Run the following sudo command to obtain the root permissions:
      sudo su - root