This topic describes how to create and configure a Dataflow Kafka cluster, which refers to a Dataflow cluster that is deployed with the Kafka service.

Usage notes

When you create a Dataflow Kafka cluster, you must select the appropriate type of Elastic Compute Service (ECS) instance and determine the number of brokers based on the estimated load of your business. No general cluster plan can be provided due to the variety of business scenarios. You need to create a cluster based on your actual environment. In most cases, we recommend that you consider the following items when you select an instance type:
  • Deploy Kafka brokers on ECS instances whose CPU-to-memory ratio is 1:4.
  • Use cloud disks to store data.
  • Consider the relationship between the I/O throughput of cloud disks and the network interface controller (NIC) bandwidth.
Consider the following factors when you configure the deployment parameters:
  • The Kafka versions used in E-MapReduce (EMR) depend on the ZooKeeper service. The availability of ZooKeeper determines whether the Kafka service is highly available. Therefore, we recommend that you turn on High Service Availability when you create a cluster. If you turn on High Service Availability when you create the cluster, three nodes are deployed for the ZooKeeper service.
  • If the master node group is only used to deploy ZooKeeper, you need to configure only one data disk for the master node group.

For more information about evaluation-based suggestions, see Suggestions for estimating cluster resources.

Procedure

  1. Go to the cluster creation page.
    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
    2. Optional:In the top navigation bar, select the region in which you want to create a cluster and select a resource group based on your business requirements.
      • You cannot change the region of a cluster after the cluster is created.
      • By default, all resource groups in your account are displayed.
    3. On the EMR on ECS page, click Create Cluster.
  2. Configure the cluster.
    To create a cluster, you must configure software parameters, hardware parameters, and basic parameters as guided by the wizard.
    Important After a cluster is created, you cannot modify its parameters except for the cluster name. Make sure that all parameters are correctly configured when you create a cluster.
    1. Configure software parameters.
      Configure software parameters
      ParameterExampleDescription
      RegionChina (Hangzhou)The region in which you want to create the cluster. You cannot change the region of a cluster after the cluster is created.
      Business ScenarioReal-time Data StreamingThe scenario in which you want to use the cluster. Select Real-time Data Streaming.
      Product VersionEMR-3.43.1The version of EMR. After you select an EMR version, you can view the version of each service.

      For example, in an EMR V3.43.1 cluster, the version of Kafka is 2.12_2.4.1. The value 2.12 indicates the Scala version, and the value 2.4.1 indicates the version of open source Kafka.

      High Service AvailabilityOnBy default, the switch is turned off.
      Important If you turn on High Service Availability when you create the cluster, three nodes are deployed in the master node group for the ZooKeeper service. The Kafka versions used in EMR depend on the ZooKeeper service. Therefore, when you create a cluster, we recommend that you turn on High Service Availability.
      Optional Services (Select One At Least)Kafka

      The services that you want to deploy in the cluster. Select Kafka.

      You can select other services based on your business requirements. By default, the relevant components of the services that you selected are started.
      Advanced SettingsOffCustom Software Configuration: specifies custom software settings. You can use a JSON file to specify custom parameters for basic services required for a cluster, such as Hadoop, Spark, and Hive. By default, the switch is turned off.
    2. Configure hardware parameters.
      ParameterExampleDescription
      Billing MethodPay-as-you-goThe billing method of the cluster. By default, Subscription is selected. EMR supports the following billing methods:
      • Pay-as-you-go: a billing method that allows you to pay for an instance after you use the instance. The system charges you for a cluster based on the hours the cluster is actually used. Bills are generated on an hourly basis at the top of every hour. We recommend that you use pay-as-you-go clusters for short-term test jobs or dynamically scheduled jobs.
      • Subscription: a billing method that allows you to use an instance only after you pay for the instance.
        Note

        We recommend that you create a pay-as-you-go cluster for a test run. If the cluster passes the test, you can create a subscription cluster for production.

      ZoneZone IThe zone in which you want to create a cluster. A zone in a region is a physical area with independent power supplies and network facilities. Clusters in zones within the same region can communicate with each other over an internal network. In most cases, you can use the zone that is selected by default.
      VPCemr_test/vpc-bp1f4epmkvncimpgs****The virtual private cloud (VPC) where you want to deploy the cluster. An existing VPC is selected by default.

      If you want to use a new VPC, go to the VPC console to create one. For more information, see Create and manage a VPC.

      vSwitchvsw_test/vsw-bp1e2f5fhaplp0g6p****The vSwitch of the cluster. Select a vSwitch in the specific zone based on your business requirements. If no vSwitch is available in the zone, go to the VPC console to create one. For more information, see Create and manage a vSwitch.
      Default Security Groupsg-bp1ddw7sm2risw****/sg-bp1ddw7sm2risw****The security group of the cluster. By default, an existing security group is selected. For more information about security groups, see Overview.

      You can also click create a new security group to create a security group in the ECS console. For more information, see Create a security group.

      Important Do not use an advanced security group that is created in the ECS console.
      Node GroupConfigure settings based on your business requirements
      • Instance Type: You can select instance types and specifications based on your business requirements or based on evaluation-based suggestions. For more information about evaluation-based suggestions, see Suggestions for estimating cluster resources.
      • Add to Deployment Set: If you turn on High Service Availability, the master nodes are added to a deployment set by default. For more information about deployment sets, see Add nodes to the deployment set.
      • System Disk: You can select a type of system disk based on your business requirements.
      • System disk size: You can specify the size of a disk based on your business requirements. The recommended minimum disk size is 120 GiB. Valid values: 80 to 500. Unit: GiB.
      • Data Disk: You can select a type of data disk based on your business requirements.
        Note We recommend that you select a cloud disk type.
      • Data disk size: You can specify the size of a disk based on your business requirements. The recommended minimum disk size is 80 GiB. Valid values: 40 to 32768. Unit: GiB.
      • Instances: By default, three master nodes and three core nodes are deployed.
      • Additional Security Group: You can associate the node group with a maximum of two additional security groups. An additional security group allows for interactions between different external resources and applications in a flexible manner.
      • Assign Public Network IP: specifies whether to associate an elastic IP address (EIP) with the cluster. By default, this switch is turned off.
        Note If you do not turn on this switch but want to access the cluster over the Internet after you create the cluster, you must apply for a public IP address on ECS. For information about how to apply for an EIP address, see Elastic IP addresses.
    3. Configure basic parameters.
      Configure parameters in the Basic Configuration step.
      Important The following table describes all parameters. However, the parameters in the Advanced Settings section are not supported. Do not configure the parameters in this section.
      ParameterExampleDescription
      Cluster NameEmr-KafkaThe name of the cluster. The name must be 1 to 64 characters in length and can contain only letters, digits, hyphens (-), and underscores (_).
      Identity CredentialsCustom passwordKey Pair (default): Use an SSH key pair to access the Linux instance.

      For information about how to use an SSH key pair, see SSH key pair overview.

      Password: Use the password that you specified for the master node to access the Linux instance.

      The password must be 8 to 30 characters in length and must contain uppercase letters, lowercase letters, digits, and special characters.

      The following special characters are supported:

      ! @ # $ % ^ & *

      Advanced SettingsConfigure the parameters based on your business requirements
      • ECS Application Role: You can assign an ECS application role to a cluster. Then, EMR applies for a temporary AccessKey pair when applications running on the compute nodes of the cluster access other Alibaba Cloud services, such as OSS. This way, you do not need to manually enter an AccessKey pair. You can grant the access permissions of the application role on specific Alibaba Cloud services based on your business requirements.
      • Bootstrap Actions: Optional. You can configure bootstrap actions to run custom scripts before a cluster starts Hadoop. For more information, see Manage bootstrap actions.
      • Tags: Optional. You can add a tag when you create a cluster or add a tag on the Basic Information tab after a cluster is created. For more information, see Manage and use tags.
      • Resource Group: Optional. For more information, see Use resource groups.
      • Data Disk Encryption: Optional. You can turn on this switch only when you create a cluster. For more information, see Enable data disk encryption.
  3. In the Confirm step, read the terms of service and select the check box.
  4. Click Confirm.
    Refresh the EMR on ECS page to view the creation progress. When Status becomes Running, the cluster is created.

What to do next

After the cluster is created, you can modify the values of the default parameters of the cluster to meet production requirements. Examples: