A Dataflow Kafka cluster is an E-MapReduce (EMR) cluster deployed with Kafka in the Real-time Data Streaming scenario. This guide covers instance selection, software and network configuration, and the steps to complete the cluster creation wizard.
Kafka is no longer supported in EMR V5.18.0, EMR V3.52.0, or any minor version earlier than those releases. Use ApsaraMQ for Kafka or install Kafka manually instead.
Prerequisites
Before you begin, ensure that you have:
An Alibaba Cloud account with the permissions to create EMR clusters
A virtual private cloud (VPC) and a vSwitch in your target region and zone
A security group (do not use an advanced security group created in the Elastic Compute Service (ECS) console)
Create a Dataflow Kafka cluster
After a cluster is created, you cannot modify any parameters except the cluster name. Verify all settings before clicking Confirm.
Step 1: Go to the cluster creation page
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
(Optional) In the top navigation bar, select a region and a resource group.
The region cannot be changed after the cluster is created.
All resource groups in your account are displayed by default.
Click Create Cluster.
Step 2: Configure software settings
| Parameter | Description |
|---|---|
| Region | The region where the cluster is created. Cannot be changed after creation. Example: China (Hangzhou). |
| Business scenario | Select Real-time Data Streaming for a Kafka cluster. |
| Product version | The EMR version determines the version of each bundled service. For example, EMR-3.43.1 includes Kafka 2.12_2.4.1, where 2.12 is the Scala version and 2.4.1 is the open-source Kafka version. |
| High Service Availability | Off by default. Turn this on to deploy three ZooKeeper nodes in the master node group. Because Kafka availability depends on ZooKeeper availability, we recommend turning this on when you create a cluster. If the master node group is used only for ZooKeeper, configure a single data disk for that group. For sizing guidance, see Suggestions for evaluating cluster resources. |
| Optional Services (Select One At Least) | Select Kafka. Add other services based on your requirements. Selected services are started automatically. |
| Collect Service Operational Logs | On by default. Keep this on — turning it off limits EMR cluster health checks and service-related technical support. After cluster creation, you can change this on the Basic Information tab. For details, see How do I stop collection of service operational logs? |
Step 3: Configure hardware settings
| Parameter | Description |
|---|---|
| Billing method | Subscription (default) or pay-as-you-go. Use pay-as-you-go for short-term tests or dynamically scheduled jobs; use Subscription for stable production workloads. |
| Zone | The zone where the cluster is deployed. Clusters within the same region communicate over the internal network. |
| VPC | The VPC for the cluster. An existing VPC is selected by default. To create one, see Create and manage a VPC. |
| vSwitch | The vSwitch in the selected zone. If no vSwitch is available, see Create and manage a vSwitch. |
| Default Security Group | The security group for the cluster. An existing group is selected by default. To create one, click create a new security group. Do not use an advanced security group created in the ECS console. For an overview, see Overview. |
| Node Group | Configure each node group as follows: |
Node group settings:
For Kafka brokers, use ECS instances with a CPU-to-memory ratio of 1:4. Balance the I/O throughput of cloud disks against the network interface controller (NIC) bandwidth when sizing nodes.
Instance type: Select instance types based on your workload or refer to Suggestions for evaluating cluster resources.
Add to Deployment Set: If High Service Availability is on, master nodes are added to a deployment set by default. See Add nodes to the deployment set.
System Disk: Select a disk type. Minimum recommended size: 120 GiB. Valid range: 80–500 GiB.
Data Disk: Use cloud disks. Minimum recommended size: 80 GiB. Valid range: 40–32768 GiB.
Instances: Three master nodes and three core nodes are deployed by default.
Additional Security Group: Associate up to two additional security groups per node group.
Assign Public Network IP: Off by default. Turn on to associate an elastic IP address (EIP) with the cluster. See What is an Elastic IP Address?
Step 4: Configure basic settings
Configure the parameters in the Basic Information step.
The parameters in the Advanced Settings section are not supported. Do not configure them.
| Parameter | Description |
|---|---|
| Cluster Name | 1–64 characters. Letters, digits, hyphens (-), and underscores (_) only. Example: Emr-Kafka. |
| Identity Credentials | Key Pair (default): Access the Linux instance using an SSH key pair. See SSH key pair overview. Password: Set a password for the master node. Must be 8–30 characters and include uppercase letters, lowercase letters, digits, and at least one of: ! @ # $ % ^ & * |
Step 5: Confirm and create
In the Confirm step, read the terms of service and select the checkbox.
Click Confirm.
Refresh the EMR on ECS page to monitor progress. When Status shows Running, the cluster is ready.
What's next
After the cluster is running, configure security features:
SSL encryption: Encrypt data in transit for Kafka. See Use SSL to encrypt Kafka data.
SASL authentication: Require clients to authenticate before accessing the cluster. See Log on to a Kafka cluster by using SASL.