All Products
Search
Document Center

E-MapReduce:Create a cluster

Last Updated:Aug 08, 2023

This topic describes how to create an E-MapReduce (EMR) cluster on the EMR on ECS page of the EMR console.

Note

If this is the first time you create an EMR cluster after 17:00 (UTC+8) on December 19, 2022, you cannot create a Hadoop, Data Science, Presto, or ZooKeeper cluster.

Prerequisites

RAM authorization is complete. For more information, see Assign roles.

Precautions

When you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1, if the services that you select do not depend on nodes in a newly added task node group, you can click Remove Node Group in the Actions column of the task node group in the Node Group section.

Procedure

  1. Go to the EMR on ECS page.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select the region where you want to create a cluster and select a resource group based on your business requirements.

      • You cannot change the region of a cluster after the cluster is created.

      • By default, all resource groups in your account are displayed.

  2. On the EMR on ECS page, click Create Cluster.

  3. Configure the cluster.

    When you create a cluster, you need to configure the software, hardware, and basic information, and confirm the order for the cluster.

    Note

    After a cluster is created, you cannot modify its parameters except for the cluster name. Make sure that all parameters are correctly configured when you create a cluster.

    1. Configure software parameters.

      Parameter

      Description

      Region

      The geographic location where the Elastic Compute Service (ECS) instances of the cluster are located.

      Business Scenario

      • New Data Lake (DataLake): provides a big data compute engine that allows you to analyze data in a flexible, reliable, and efficient manner.

        • Supports the data lake architecture and accelerates data queries in data lakes based on JindoFS.

        • Supports the OSS-HDFS (fully managed HDFS) service for storage, which helps you reduce O&M costs. You are charged based on actual usage of the OSS-HDFS service.

        For more information, see DataLake cluster.

      • Data Analytics (OLAP): provides efficient, real-time, and flexible data analytics capabilities to meet the requirements of various business scenarios, such as user profiling, recipient selection, BI reports, and business analytics. You can write data to online analytical processing (OLAP) engines such as ClickHouse and StarRocks for analysis by importing data or using external tables.

      • Real-time Data Streaming (Dataflow): provides an end-to-end (E2E) real-time computing solution. Dataflow clusters incorporate Kafka, a distributed message system with high throughput and scalability, and the commercial Flink kernel provided by Apache Flink-powered Ververica. The clusters are used to resolve various E2E real-time computing issues and are widely used in real-time data extract, transform, and load (ETL), and log collection and analysis scenarios. You can use one of the two components or both.

      • Data Service (DataServing):

        • Provides a DataServing cluster that allows you to analyze data in a flexible, reliable, and efficient manner.

        • Provides semi-managed HBase clusters and can decouple computing clusters from data storage based on the OSS-HDFS (JindoFS) service.

        • Supports data caching by using JindoData to improve the read and write performance of DataServing clusters.

        For more information, see DataServing cluster.

      • More > Custom Cluster (Custom): allows you to select services based on your business requirements when you create a custom cluster.

        Note

        We recommend that you do not deploy multiple storage services on one node group in the production environment.

      • More > Machine Learning (Data Science): is used for big data and AI scenarios.

        • Provides a distributed deep learning framework.

        • Provides more than 200 typical machine learning algorithm packages.

        • Provides AutoML capabilities and more than 10 deep learning algorithms, covering scenarios such as recommendation and advertising.

      • More > Data Lake: provides frameworks and pipelines for you to process and analyze large amounts of big data, and supports open source components such as Apache Hive, Spark, and Presto. The following types of clusters are supported:

        • Hadoop:

          • Provides a complete list of open source components that are fully compatible with the Hadoop ecosystem.

          • Supports various scenarios such as big data offline processing, real-time processing, and interactive query.

          • Supports the data lake architecture and accelerates data queries in data lakes based on JindoFS.

        • Zookeeper: provides a distributed and consistent lock service that facilitates coordination among large-scale Hadoop, HBase, and Kafka clusters.

        • Presto: is an in-memory distributed SQL engine used for interactive queries. Presto clusters support various data sources and are suitable for complex analysis of petabytes of data and cross-data source queries.

      Product Version

      The version of EMR. The latest version is displayed by default.

      High Service Availability

      This switch is turned off by default. If you turn on the switch, multiple master nodes are created in the cluster to ensure the availability of the ResourceManager and NameNode processes.

      Optional Services (Select One At Least)

      The services that you can select for the cluster. You can select services based on your business requirements. The processes related to the services that you select are automatically started.

      Note

      The more services you select, the higher instance specifications a cluster needs to handle the services. You must select the instance type that matches the number of services you specified when you configure the hardware. Otherwise, resources may be insufficient for the cluster to run the services.

      Metadata

      • DLF Unified Metadata: Metadata is stored in Data Lake Formation (DLF).

        If you select DLF Unified Metadata, the system selects a DLF catalog for you to store metadata. If you want the metadata of different clusters to be stored in different catalogs, go to the DLF console to create data catalogs.

      • Self-managed RDS: Metadata is stored in a self-managed ApsaraDB RDS database.

        If you select Self-managed RDS, you must configure the parameters that are related to database connection. For more information, see Configure a self-managed ApsaraDB RDS for MySQL database.

      • Built-in MySQL: Metadata is stored in the local MySQL database of a cluster.

        Note

        We recommend that you select Built-in MySQL only in the test environment and select DLF Unified Metadata or Self-managed RDS in the production environment.

      DLF Catalog

      If you select DLF Unified Metadata for Metadata, the system selects a DLF catalog for you to store metadata.

      Root Storage Directory of Cluster

      The root storage directory of cluster data. Select a bucket for which the OSS-HDFS service is enabled.

      Note
      • Before you use the OSS-HDFS service, make sure that the OSS-HDFS service is available in the region in which you want to create a cluster. If the OSS-HDFS service is unavailable in the region, you can change the region or use HDFS instead of OSS-HDFS. For more information about the regions in which OSS-HDFS is available, see Enable OSS-HDFS and grant access permissions.

      • You can select the OSS-HDFS service when you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1. If you select HDFS instead of OSS-HDFS, you do not need to configure this parameter.

      HBase Log Storage

      This check box is selected by default, which indicates that HBase stores HLog files in HDFS.

      Note

      This parameter is available only if the OSS-HDFS and HBase services are selected when you create a cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1.

      Hive Storage Mode

      The storage mode of Hive data. By default, Data Lake Storage is selected. An OSS-HDFS or Object Storage Service (OSS) directory is used for storage. If you clear the check box, HDFS of the cluster is used for storage.

      Note

      This parameter is available only if the Hive service is selected when you create a cluster of EMR V5.12.0, EMR V3.46.0, or a minor version earlier than EMR V5.12.0 or EMR V3.46.0.

      Hive Data Warehouse Path

      The OSS or OSS-HDFS bucket that is used for storage. We recommend that you select an OSS-HDFS bucket.

      Note
      • This parameter is available only if the Hive service is selected and Data Lake Storage is selected when you create a cluster of EMR V5.12.0, EMR V3.46.0, or a minor version earlier than EMR V5.12.0 or EMR V3.46.0.

      • Make sure that you have the required permissions to access the selected OSS or OSS-HDFS bucket.

      HBase Storage Mode

      The storage mode of HBase data files. OSS-HDFS or HDFS is used for storage.

      Note

      This parameter is available only if the HBase service is selected when you create a cluster of EMR V5.12.0, EMR V3.46.0, or a minor version earlier than EMR V5.12.0 or EMR V3.46.0.

      HBase Storage Path

      The OSS-HDFS bucket that is used for storage.

      Note

      This parameter is available only if the HBase service is selected and the HBase Storage Mode parameter is set to OSS-HDFS when you create a cluster of EMR V5.12.0, EMR V3.46.0, or a minor version earlier than EMR V5.12.0 or EMR V3.46.0.

      Advanced Settings

      • Kerberos Authentication: specifies whether to enable Kerberos authentication for the cluster. This switch is turned off by default.

        Important

        You cannot turn on the switch for the Knox and Kudu services.

      • Custom Software Configuration: specifies whether to customize the configurations of software. You can use a JSON file to customize the configurations of the basic software required for a cluster, such as Hadoop, Spark, and Hive. For more information, see Customize software configurations. This switch is turned off by default.

        Note

        For more information about how to configure the parallelism of Hive jobs, see How do I estimate the maximum number of Hive jobs that can be concurrently run?

    2. Configure hardware parameters.

      Parameter

      Description

      Billing Method

      The billing method of the cluster. Subscription is selected by default. EMR supports the following billing methods:

      • Pay-as-you-go: a billing method that allows you to pay for an instance after you use the instance. The system charges you for a cluster based on the hours the cluster is actually used. Bills are generated on an hourly basis at the top of every hour. We recommend that you use pay-as-you-go clusters for short-term test jobs or dynamically scheduled jobs.

      • Subscription: a billing method that allows you to use an instance only after you pay for the instance.

        Note
        • We recommend that you create a pay-as-you-go cluster for a test run. If the cluster passes the test, you can create a subscription cluster for production.

        • If you select Subscription for Billing Method, you must also specify Subscription Duration and Auto-renewal. By default, the subscription period is six months and the Auto-renewal switch is turned on. If you turn on Auto-renewal, the system renews your subscription for one more month seven days before the expiration date. For more information, see Renewal policy.

      Zone

      The zone where you want to create a cluster. A zone in a region is a physical area with independent power supplies and network facilities. Clusters in zones within the same region can communicate with each other over an internal network. In most cases, you can use the zone that is selected by default.

      VPC

      The virtual private cloud (VPC) where you want to deploy the cluster. An existing VPC is selected by default.

      If you want to use a new VPC, go to the VPC console to create one. For more information, see Create and manage a VPC.

      VSwitch

      The vSwitch of the cluster. Select a vSwitch in the specific zone based on your business requirements. If no vSwitch is available in the zone, go to the VPC console to create one. For more information, see Create and manage a vSwitch.

      Default Security Group

      The security group of the cluster. An existing security group is selected by default. For more information about security groups, see Overview.

      You can also click create a new security group to create a security group in the ECS console. For more information, see Create a security group.

      Important

      Do not use an advanced security group that is created in the ECS console.

      Node Group

      The node groups of the cluster. You can select instance types based on your business requirements. For more information, see Instance families.

      • Master node group: runs control processes, such as ResourceManager and NameNode.

        • Add to Deployment Set: If you turn on the High Service Availability switch, the master nodes are added to a deployment set by default.

          For more information about deployment sets, see Overview.

        • System Disk: You can select a standard SSD, enhanced SSD, or ultra disk based on your business requirements.

          You can adjust the size of the system disk based on your business requirements. Valid values: 80 to 5000. Unit: GiB.

        • Data Disk: You can select standard SSDs, enhanced SSDs, or ultra disks based on your business requirements.

          You can adjust the size of the data disks based on your business requirements. Valid values: 40 to 32768. Unit: GiB.

        • Instances: One master node is configured by default. If you turn on the High Service Availability switch, multiple master nodes can be configured.

        • Additional Security Group: An additional security group allows interactions between different external resources and applications. You can associate a node group with up to two additional security groups.

        • Assign Public Network IP: specifies whether to associate an elastic IP address (EIP) with the cluster. This switch is turned off by default. You can assign public IP addresses only to the node groups of DataLake clusters.

          Note

          For information about how to apply for an EIP address, see Elastic IP addresses.

      • Core node group: stores all the data of a cluster. You can add core nodes based on your business requirements after a cluster is created.

        • Add to Deployment Set: This switch is turned off by default. For more information about deployment sets, see Overview.

        • System Disk: You can select a standard SSD, enhanced SSD, or ultra disk based on your business requirements.

          You can adjust the size of the system disk based on your business requirements. Valid values: 80 to 5000. Unit: GiB.

        • Data Disk: You can select standard SSDs, enhanced SSDs, or ultra disks based on your business requirements.

          You can adjust the size of the data disks based on your business requirements. Valid values: 40 to 32768. Unit: GiB.

        • Instances: Two core nodes are configured by default. You can change the number of core nodes based on your business requirements.

        • Additional Security Group: An additional security group allows interactions between different external resources and applications. You can associate a node group with up to two additional security groups.

        • Assign Public Network IP: specifies whether to associate an EIP with the cluster. This switch is turned off by default. You can assign public IP addresses only to the node groups of DataLake clusters.

          Note

          For information about how to apply for an EIP address, see Elastic IP addresses.

      • Task node group: stores no data and is used to adjust the computing capabilities of clusters. No task node group is configured by default. You can configure a task node group based on your business requirements.

    3. Configure basic parameters.

      Parameter

      Description

      Cluster Name

      The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).

      Identity Credentials

      Key Pair (default): Use an SSH key pair to access the Linux instance.

      For information about how to use a key pair, see SSH key pair overview.

      Password: Use the password that you set for the master node to access the Linux instance.

      The password must be 8 to 30 characters in length and must contain uppercase letters, lowercase letters, digits, and special characters.

      The following special characters are supported:

      ! @ # $ % ^ & *

      Advanced Settings

      • ECS Application Role: You can assign an ECS application role to a cluster. Then, EMR applies for a temporary AccessKey pair when applications running on the compute nodes of the cluster access other Alibaba Cloud services, such as OSS. This way, you do not need to manually enter an AccessKey pair. You can grant the access permissions of the application role on specific Alibaba Cloud services based on your business requirements.

      • Bootstrap Actions: Optional. You can configure bootstrap actions to run custom scripts before a cluster starts Hadoop. For more information, see Manage bootstrap actions.

      • Tags: Optional. You can add a tag when you create a cluster or add a tag on the Basic Information tab after a cluster is created. For more information, see Manage and use tags.

      • Resource Group: Optional. For more information, see Use resource groups.

      • Data Disk Encryption: Optional. You can turn on this switch only when you create a cluster. For more information, see Enable data disk encryption.

      Note

      After you complete the configurations, click Next: Confirm. In the Confirm step, you can confirm the configurations and the fee for the creation of your cluster. The fee varies based on the billing method.

  4. After you verify that the configurations are correct, read the terms of service, select the check box, and then click Next: Confirm.

    Important
    • Pay-as-you-go clusters: The cluster is created immediately.

      After the cluster is created, the cluster is in the Running state.

    • Subscription clusters: An order is generated. The cluster will be created after you complete the payment.