This topic describes how to create an E-MapReduce (EMR) cluster on the EMR on ECS page of the EMR console.

Prerequisites

RAM authorization is complete. For more information, see Assign roles.

Procedure

  1. Go to the EMR on ECS page.
    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
    2. In the top navigation bar, select the region where you want to create a cluster and select a resource group based on your business requirements.
      • You cannot change the region of a cluster after the cluster is created.
      • By default, all resource groups in your account are displayed.
  2. On the EMR on ECS page, click Create Cluster.
  3. Configure the cluster.
    When you create the cluster, you need to configure the software, hardware, and basic information, and confirm the order for the cluster.
    Important After a cluster is created, you cannot modify its parameters except for the cluster name. Make sure that all parameters are correctly configured when you create a cluster.
    1. Configure software parameters.
      ParameterDescription
      RegionThe geographic location where the Elastic Compute Service (ECS) instances of the cluster are located.
      Business Scenario
      Note If this is the first time you create an EMR cluster after 17:00 (UTC +8) on December 19, 2022, you cannot create a Hadoop, Data Science, Presto, or ZooKeeper cluster.
      • New Data Lake: provides a big data computing cluster that allows you to analyze data in a flexible, reliable, and efficient manner.
        • Supports the data lake architecture and accelerates data queries in data lakes based on JindoFS.
        • Supports the OSS-HDFS (fully managed HDFS) service for storage, which helps you reduce O&M costs. You are charged based on actual usage of the OSS-HDFS service.

        For more information, see DataLake cluster.

        In this business scenario, you can create a DataLake cluster.

      • Data Analytics: provides efficient, real-time, and flexible data analytics capabilities to meet requirements of various business scenarios, such as user profiling, recipient selection, BI reports, and business analytics. You can write data to OLAP engines such as ClickHouse and StarRocks for analysis by importing data or using external tables.

        In this business scenario, you can create an OLAP cluster.

      • Real-time Data Streaming: provides an end-to-end (E2E) real-time computing solution. Dataflow clusters incorporate Kafka, a distributed message system with high throughput and scalability, and the commercial Flink kernel provided by Apache Flink-powered Ververica. The clusters are used to resolve various E2E real-time computing issues and are widely used in real-time data ETL, and log collection and analysis scenarios. You can use one of the two components or both.

        In this business scenario, you can create a Dataflow cluster.

      • Data Service:
        • Provides a DataServing cluster that allows you to analyze data in a flexible, reliable, and efficient manner.
        • Provides semi-managed HBase clusters and can decouple computing clusters from data storage based on the OSS-HDFS (JindoFS) service.
        • Supports data caching by using JindoData to improve the read and write performance of DataServing clusters.

        In this business scenario, you can create a DataServing cluster.

      • More > Machine Learning: is used for big data and AI scenarios.
        • Provides a distributed deep learning framework.
        • Provides more than 200 typical machine learning algorithm packages.
        • Provides AutoML capabilities and more than 10 deep learning algorithms, covering scenarios such as recommendation and advertising.

        In this business scenario, you can create a Data Science cluster.

      • More > Data Lake: provides frameworks and pipelines for you to process and analyze large amounts of big data, and supports open source components such as Apache Hive, Spark, and Presto. The following types of clusters are supported:
        • Hadoop:
          • Provides a complete list of open source components that are fully compatible with the Hadoop ecosystem.
          • Supports various scenarios such as big data offline processing, real-time processing, and interactive query.
          • Supports the data lake architecture and accelerates data queries in data lakes based on JindoFS.
        • Zookeeper: provides a distributed and consistent lock service that facilitates coordination among large-scale Hadoop, HBase, and Kafka clusters.
        • Presto: is an in-memory distributed SQL engine used for interactive queries. Presto clusters support various data sources and are suitable for complex analysis of petabytes of data and cross-data source queries.
      • More > Custom Cluster: allows you to select services based on your business requirements when you create a custom cluster.
        Note We recommend that you do not deploy multiple storage services on one node group in the production environment.
      Product VersionThe version of EMR. The latest version is displayed by default.
      High Service AvailabilityThis switch is turned off by default. If you turn on the switch, multiple master nodes are created in the cluster to ensure the availability of the ResourceManager and NameNode processes.
      Optional Services (Select One At Least)The services that you can select for the cluster. You can select services based on your business requirements. The processes related to the services that you select are automatically started.
      Note The more services you select, the higher instance specifications a cluster needs to handle the services. You must select the instance type that matches the number of services you specified when you configure the hardware. Otherwise, resources may be insufficient for the cluster to run the services.
      Metadata
      • DLF Unified Metadata: Metadata is stored in Data Lake Formation (DLF).

        If you select DLF Unified Metadata, the system selects a DLF catalog for you to store metadata. If you want the metadata of different clusters to be stored in different catalogs, go to the DLF console to create data catalogs.

      • Self-managed RDS: Metadata is stored in a self-managed ApsaraDB RDS database. For more information, see Configure a self-managed ApsaraDB RDS for MySQL database.
      • Built-in MySQL: Metadata is stored in the local MySQL database of a cluster.
        Note We recommend that you select Built-in MySQL only in the test environment and select DLF Unified Metadata or Self-managed RDS in the production environment.
      Advanced Settings
      • Kerberos Authentication: specifies whether to enable Kerberos authentication for the cluster. This switch is turned off by default.
        Important You cannot turn on the switch for the Knox and Kudu services.
      • Custom Software Configuration: specifies whether to customize the configurations of software. You can use a JSON file to customize the configurations of the basic software required for a cluster, such as Hadoop, Spark, and Hive. For more information, see Customize software configurations. This switch is turned off by default.
        Note For more information about how to configure the parallelism of Hive jobs, see How do I estimate the maximum number of Hive jobs that can be concurrently run?.
    2. Configure hardware parameters.
      ParameterDescription
      Billing MethodThe billing method of the cluster. Subscription is selected by default. EMR supports the following billing methods:
      • Pay-as-you-go: a billing method that allows you to pay for an instance after you use the instance. The system charges you for a cluster based on the hours the cluster is actually used. Bills are generated on an hourly basis at the top of every hour. We recommend that you use pay-as-you-go clusters for short-term test jobs or dynamically scheduled jobs.
      • Subscription: a billing method that allows you to use an instance only after you pay for the instance.
        Note
        • We recommend that you create a pay-as-you-go cluster for a test run. If the cluster passes the test, you can create a subscription cluster for production.
        • If you select Subscription for Billing Method, you must also specify Subscription Duration and Auto-renewal. By default, the subscription period is six months and the Auto-renewal switch is turned on. If you turn on the Auto-renewal switch, the system renews your subscription for one more month seven days before the expiration date. For more information, see Renewal policy.
      ZoneThe zone where you want to create a cluster. A zone in a region is a physical area with independent power supplies and network facilities. Clusters in zones within the same region can communicate with each other over an internal network. In most cases, you can use the zone that is selected by default.
      VPCThe virtual private cloud (VPC) where you want to deploy the cluster. An existing VPC is selected by default.

      If you want to use a new VPC, go to the VPC console to create one. For more information, see Create and manage a VPC.

      vSwitchThe vSwitch of the cluster. Select a vSwitch in the specific zone based on your business requirements. If no vSwitch is available in the zone, go to the VPC console to create one. For more information, see Create and manage a vSwitch.
      Default Security GroupThe security group of the cluster. An existing security group is selected by default. For more information about security groups, see Overview.

      You can also click create a new security group to create a security group in the ECS console. For more information, see Create a security group.

      Important Do not use an advanced security group that is created in the ECS console.
      Node Group

      The node groups of the cluster. You can select instance types based on your business requirements. For more information, see Instance families.

      • Master node group: runs control processes, such as ResourceManager and NameNode.
        • Add to Deployment Set: If you turn on the High Service Availability switch, the master nodes are added to a deployment set by default.

          For more information about deployment sets, see Overview.

        • System Disk: You can select a standard SSD, enhanced SSD, or ultra disk based on your business requirements.

          You can adjust the size of the system disk based on your business requirements. Valid values: 80 to 5000. Unit: GiB.

        • Data Disk: You can select standard SSDs, enhanced SSDs, or ultra disks based on your business requirements.

          You can adjust the size of the data disks based on your business requirements. Valid values: 40 to 32768. Unit: GiB.

        • Instances: One master node is configured by default. If you turn on the High Service Availability switch, multiple master nodes can be configured.
        • Additional Security Group: An additional security group allows interactions between different external resources and applications. You can associate a node group with up to two additional security groups.
        • Assign Public Network IP: specifies whether to associate an EIP with the cluster. This switch is turned off by default. You can assign public IP addresses only to the node groups of DataLake clusters.
          Note To access the cluster over the Internet, you must apply for a public IP address on ECS. For information about how to apply for an EIP address, see Elastic IP addresses.
      • Core node group: stores all the data of a cluster. You can add core nodes based on your business requirements after a cluster is created.
        • Add to Deployment Set: This switch is turned off by default. For more information about deployment sets, see Overview.
        • System Disk: You can select a standard SSD, enhanced SSD, or ultra disk based on your business requirements.

          You can adjust the size of the system disk based on your business requirements. Valid values: 80 to 5000. Unit: GiB.

        • Data Disk: You can select standard SSDs, enhanced SSDs, or ultra disks based on your business requirements.

          You can adjust the size of the data disks based on your business requirements. Valid values: 40 to 32768. Unit: GiB.

        • Instances: Two core nodes are configured by default. You can change the number of core nodes based on your business requirements.
        • Additional Security Group: An additional security group allows interactions between different external resources and applications. You can associate a node group with up to two additional security groups.
        • Assign Public Network IP: specifies whether to associate an EIP with the cluster. This switch is turned off by default. You can assign public IP addresses only to the node groups of DataLake clusters.
          Note To access the cluster over the Internet, you must apply for a public IP address on ECS. For information about how to apply for an EIP address, see Elastic IP addresses.
      • Task node group: stores no data and is used to adjust the computing capabilities of clusters. No task node group is configured by default. You can configure a task node group based on your business requirements.
    3. Configure basic parameters.
      ParameterDescription
      Cluster NameThe name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
      Identity CredentialsKey Pair (default): Use an SSH key pair to access the Linux instance.

      For information about how to use a key pair, see SSH key pair overview.

      Password: Use the password that you set for the master node to access the Linux instance.

      The password must be 8 to 30 characters in length and must contain uppercase letters, lowercase letters, digits, and special characters.

      The following special characters are supported:

      ! @ # $ % ^ & *

      Advanced Settings
      • ECS Application Role: You can assign an ECS application role to a cluster. Then, EMR applies for a temporary AccessKey pair when applications running on the compute nodes of the cluster access other Alibaba Cloud services, such as OSS. This way, you do not need to manually enter an AccessKey pair. You can grant the access permissions of the application role on specific Alibaba Cloud services based on your business requirements.
      • Bootstrap Actions: Optional. You can configure bootstrap actions to run custom scripts before a cluster starts Hadoop. For more information, see Manage bootstrap actions.
      • Tags: Optional. You can add a tag when you create a cluster or add a tag on the Basic Information tab after a cluster is created. For more information, see Manage and use tags.
      • Resource Group: Optional. For more information, see Use resource groups.
      • Data Disk Encryption: Optional. You can turn on this switch only when you create a cluster. For more information, see Enable data disk encryption.
      Note After you complete the configurations, click Next: Confirm. In the Confirm step, you can confirm the configurations and the fee for the creation of your cluster. The fee varies based on the billing method.
  4. After you verify that the configurations are correct, read the terms of service, select the check box, and then click Confirm.
    Important
    • Pay-as-you-go clusters: The cluster is created immediately.

      After the cluster is created, the cluster is in the Running state.

    • Subscription clusters: An order is generated. The cluster will be created after you complete the payment.