Suggestions for estimating cluster resources - E-MapReduce

This topic describes how to estimate the hardware resources required by an E-MapReduce (EMR) cluster that is deployed with the Kafka service based on simple rules in general business scenarios. In actual business scenarios, you can use these rules to estimate the resources that are required and then determine the final cluster specifications based on the results of a load test. After a cluster is created, you can use the cluster scale-out feature to change the resource configurations of the cluster based on the actual resource usage.

Many factors affect the hardware resources required by Kafka clusters. General factors include the peak message traffic, the average size of messages, the number of partitions, the replication factors, and the number of clients. Non-Kafka service control factors include the business scenarios in which clusters are used and the performance of business applications. Therefore, when you estimate the hardware resources required by a cluster, you must first estimate the actual business scale and then use the results as business parameters to estimate the required resources. You can use tools such as kafka-producer-perf-test and kafka-consumer-perf-test to simulate the actual load and further estimate the hardware resources that you need.

Master node group (ZooKeeper)

The master node group is used to install the ZooKeeper component. In addition, Kafka ecosystem components such as Kafka Manager, Schema Registry, and REST Proxy are also installed on the master node group.

In most cases, we recommend that you configure the following specifications for the master node group:

Number of nodes: three.
Data disk: Select a cloud disk with a storage capacity of 120 GB.
System disk: 80 GB.
CPU: four CPU cores.
Memory: 8 GB.

Important We recommend that you select an instance type with a CPU-to-memory ratio of 1:2.

Core node group (Kafka broker)

Estimated business requirements

You must estimate business requirements based on the following parameters:

Fan-out factor: the number of times that business data is consumed by downstream nodes, excluding the number of times that data is consumed by in-cluster replication of Kafka.
Peak inbound traffic: the peak traffic of business data. Unit: MB/s.
Average inbound traffic: the average traffic of business data. Unit: MB/s.
Data retention period: the number of days for which data is retained. By default, data is retained for seven days.
Partition replication factor: the number of replicas for a partition. By default, each partition has three replicas.

Note You must fully consider the peak traffic based on the actual business situation. The peak traffic is usually one order of magnitude higher than the average traffic.

When you estimate business requirements based on the preceding parameters, you must appropriately retain redundant resources. This way, the cluster can still provide services even if the cluster is heavily loaded in extreme business scenarios. Based on the preceding parameters, the following metrics can be calculated:

Total peak write traffic of the cluster = Peak inbound traffic × Partition replication factor
Total peak read traffic of the cluster = Peak inbound traffic × (Fan-out factor + Partition replication factor - 1)
Total storage capacity: Average inbound traffic × Data retention period × Partition replication factor

Recommended node specifications

In most cases, we recommend that you configure the following specifications for the core node group:

Number of nodes: Estimate the number of nodes based on your business requirements. For more information, see the Number of brokers section of this topic.
CPU: 16 CPU cores.
Memory: 64 GB.

Important We recommend that you select an instance type with a CPU-to-memory ratio of 1:4.
System disk: 80 GB.
Data disk: Select four cloud disks with a storage capacity that is estimated based on your business requirements.
Network interface controller (NIC) bandwidth: Calculate the NIC bandwidth based on the total I/O of disks on a node.

Note

We recommend that you use a cloud disk as a data disk to prevent O&M workloads caused by disk failures. This can make sure a higher service availability and reduce O&M labor costs.
After you select the data disk type and the number of disks, you can calculate the total I/O throughput of disks. We recommend that you select the NIC bandwidth higher than or equal to the I/O throughput of disks.

Number of brokers

In ideal cases, the maximum traffic of a Kafka broker can reach the maximum I/O throughput of disks or the maximum NIC bandwidth on a node. Therefore, the number of brokers that are required can be calculated based on the peak data traffic and the I/O throughput or NIC bandwidth of each node.

Calculate the disk performance metric of a node
```
Disk throughput of a node = Throughput of a disk × Number of data disks
```
For more information about the theoretical I/O performance values of disks, see EBS performance. For example, the maximum throughput per PL1 enhanced SSD (ESSD) is 350 MB/s. We recommend that you calculate the metric related to disk throughput for local disks based on half of the theoretical value. For example, the disk throughput of a local disk is estimated at 50 MB/s in most cases.
Calculate the number of required brokers
If you configure three replicas for a partition, we recommend that you select four or more brokers. If one broker is temporarily unavailable, you can still create a partition with three replicas. In most cases, we recommend that you retain 50% of redundant hardware resources. Based on the preceding premise, the following formula can be used to calculate the number of required brokers:
```
Number of brokers = Max(4, (Total peak write traffic of the cluster + Total peak write traffic of the cluster)/Disk throughput of a single node/50%)
```
In addition, considering the limits on partition replicas, we recommend that you configure no more than 2,000 partition replicas on each broker. A broker can have a maximum of 4,000 partition replicas. The entire cluster can have a maximum of 200,000 partition replicas. If the total number of partition replicas in the cluster is estimated to be large, we recommend that you estimate the number of brokers based on the total number of partitions. In this case, the following formula can be used to calculate the number of required brokers:
```
Number of brokers = Max(4, Estimated total number of partitions × Partition replication factor/2,000)
```

Estimate the disk size of each broker

Disk size per broker = Total data storage capacity/Number of brokers/Number of data disks per node/50%

(Optional) Task node group (Kafka Connect)

This node group is optional. After a cluster is created, you can resize the cluster at any time based on the resource usage.

In most cases, we recommend that you configure the following specifications for the task node group:

Number of nodes: We recommend that you select more than two nodes. This way, the Kafka Connect cluster is highly available.
Data disk: Select a cloud disk with a storage capacity of 80 GB or more.
CPU: We recommend that you select more than eight CPU cores for each node and increase the capacity at any time based on the CPU utilization of the connector.
Memory: Select the memory capacity based on the connector type and memory usage.

Important We recommend that you select an instance type with a CPU-to-memory ratio of 1:2 or 1:4.