All Products
Search
Document Center

E-MapReduce:Create a cluster

Last Updated:Apr 23, 2024

This topic describes how to use an Alibaba Cloud account to log on to the E-MapReduce (EMR) console and create a cluster on the EMR on ACK page.

Prerequisites

Procedure

  1. Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.

  2. On the EMR on ACK page, click Create Cluster.

  3. On the E-MapReduce on ACK page, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Region

    The region in which you want to create a cluster. You cannot change the region after the cluster is created.

    Cluster Type

    The type of the cluster. Valid values:

    • Shuffle Service: an extension provided by EMR to optimize the shuffle operations of compute engines. The remote shuffle service provided by Shuffle Service allows Spark jobs to run on nodes that do not have local disks and supports dynamic resources. The service is suitable for Spark clusters in the ACK environment. For more information, see Celeborn.

      Important

      When you create a Shuffle Service cluster, nodes in the dedicated node pool or nodes of the associated ACK cluster must belong to the big data instance families or instance families with local SSDs. Otherwise, the remote shuffle service fails to be deployed.

      Shuffle

      Note

      In EMR for ACK scenarios, the system provides a built-in automatic cleanup task named rss-pvc-clean for Shuffle Service clusters. The task is used to clean up PVC resources that are no longer used in a regular manner or under specific conditions. This optimizes the management of storage resources and prevents storage space from being occupied by invalid or redundant persistent data.

    • Presto: an in-memory distributed SQL engine that is used for interactive queries.

      Presto clusters support various data sources and are suitable for complex analysis of petabytes of data and cross-data source queries.

    • Spark: a common distributed big data processing engine that provides various capabilities, such as extract, transform, and load (ETL), batch processing, and data modeling.

      Important

      If you want to associate a Spark cluster with a Shuffle Service cluster, the EMR versions of the clusters must be the same. For example, a Spark cluster whose EMR version is EMR-5.x-ack can be associated with only a Shuffle Service cluster whose EMR version is EMR-5.x-ack.

    • Flink: a distributed compute engine for stateful computing on bounded or unbounded data streams. Flink on ACK is developed based on EMR on ACK and Flink Kubernetes Operator 1.0.1. By default, Flink on ACK uses the kernel of Flink Enterprise Edition, which ensures that users can use Flink on ACK without additional configurations.

    Product Version

    The version of EMR. By default, the latest version is selected.

    Component Version

    Displays the type and version of the component that is deployed in the cluster of the specified type.

    ACK Cluster

    Select an existing ACK cluster or create an ACK cluster in the ACK console.

    You can click Configure Dedicated Nodes to configure an EMR-dedicated node. You can configure an EMR-dedicated node or node pool by adding taints and labels to the node or node pool. This way, the node or node pool can be used only for EMR.

    Note

    We recommend that you configure dedicated nodes in a node pool. If no node pool is available, create a node pool. For more information, see Create a node pool.

    OSS Bucket

    Select an existing bucket or create a bucket in the Object Storage Service (OSS) console.

    Cluster Name

    The name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).

  4. Click Create.

    If the status of the cluster changes to Running, the cluster is created.