Accurate resource sizing before cluster creation prevents both under-provisioning (which causes performance degradation under load) and over-provisioning (which wastes cost). This topic provides sizing formulas and recommended specifications for each node group in an E-MapReduce (EMR) cluster running Kafka. After the initial estimate, validate with kafka-producer-perf-test and kafka-consumer-perf-test, then use the scale-out feature to adjust configurations as your workload changes.
Kafka cluster sizing depends on: peak message traffic, average message size, partition count, replication factors, and client count. Collect actual business metrics before applying the formulas below.
Node group overview
The following table summarizes recommended specifications for each node group. Detailed sizing guidance and formulas follow in the sections below.
| Node group | Role | Nodes | CPU | Memory | System disk | Data disk |
|---|---|---|---|---|---|---|
| Master | ZooKeeper + ecosystem components | 3 | 4 cores | 8 GiB | 80 GiB | 120 GiB cloud disk |
| Core | Kafka broker | See Number of brokers | 16 cores | 64 GiB | 80 GiB | 4 x cloud disks (size varies) |
| Task (optional) | Kafka Connect | >2 | >8 cores | Based on connector | — | >80 GiB cloud disk |
Master node group (ZooKeeper)
The master node group runs ZooKeeper and the Kafka ecosystem components: Kafka Manager, Schema Registry, and REST Proxy.
Configure three master nodes with the following specifications:
| Resource | Recommended value |
|---|---|
| Nodes | 3 |
| CPU | 4 cores |
| Memory | 8 GiB |
| CPU-to-memory ratio | 1:2 |
| System disk | 80 GiB |
| Data disk | 120 GiB cloud disk |
Core node group (Kafka broker)
Business parameters
Collect the following business parameters before calculating broker count and disk size:
| Parameter | Description | Default |
|---|---|---|
| Fan-out factor | Number of times downstream nodes consume business data, excluding in-cluster replication | — |
| Peak inbound traffic | Peak business data throughput (MB/s) | — |
| Average inbound traffic | Average business data throughput (MB/s) | — |
| Data retention period | How long data is retained (days) | 7 days |
| Partition replication factor | Number of replicas per partition | 3 |
Peak traffic is typically one order of magnitude higher than average traffic. Set the peak inbound traffic value accordingly, and retain adequate redundant capacity so the cluster can sustain service under extreme load.
Use these parameters to derive the following cluster-level metrics:
| Metric | Formula |
|---|---|
| Total peak write traffic | Peak inbound traffic x Partition replication factor |
| Total peak read traffic | Peak inbound traffic x (Fan-out factor + Partition replication factor - 1) |
| Total storage capacity | Average inbound traffic x Data retention period x Partition replication factor |
Recommended node specifications
Configure core nodes with the following specifications:
| Resource | Recommended value |
|---|---|
| CPU | 16 cores |
| Memory | 64 GiB |
| CPU-to-memory ratio | 1:4 |
| System disk | 80 GiB |
| Data disks | 4 x cloud disks (size calculated below) |
Disk type. Use cloud disks as data disks to avoid the operations and maintenance (O&M) burden of physical disk failures. This improves service availability and lowers O&M labor costs.
After selecting disk type and count, calculate the total disk I/O throughput of the node. Select a network interface card (NIC) with bandwidth greater than or equal to the total disk I/O throughput.
Number of brokers
In ideal conditions, a Kafka broker's throughput ceiling is either its disk I/O throughput or NIC bandwidth. Use the following steps to calculate the required broker count.
Step 1: Calculate disk throughput per node.
Disk throughput per node = Throughput per disk x Number of data disks
For reference: the maximum throughput of a PL1 Enterprise SSD (ESSD) is 350 MB/s. For local disks, use half the theoretical value as the effective throughput — typically 50 MB/s.
For detailed disk performance values, see Block storage performance.
Step 2: Calculate the number of brokers based on traffic.
With a replication factor of 3, use at least 4 brokers so that a partition with 3 replicas can still be created if one broker is temporarily unavailable. Maintain 50% redundant capacity:
Number of brokers = Max(4, (Total peak read traffic + Total peak write traffic) / Disk throughput per node / 50%)
Step 3: Check against partition replica limits.
If the total number of partition replicas is large, cross-check using the partition-based formula:
Number of brokers = Max(4, Total number of partitions x Partition replication factor / 2,000)
Partition replica limits:
| Limit | Value |
|---|---|
| Recommended max replicas per broker | 2,000 |
| Hard max replicas per broker | 4,000 |
| Hard max replicas per cluster | 200,000 |
Step 4: Calculate disk size per broker.
Disk size per broker = Total storage capacity / Number of brokers / Number of data disks per node / 50%
Scaling after cluster creation
The 50% redundancy reserve built into the sizing formulas keeps the cluster below the load threshold where throttling begins. After cluster creation, monitor resource utilization and use the scale-out feature to adjust configurations based on actual resource usage.
Task node group (Kafka Connect) (optional)
The task node group is optional and runs Kafka Connect. It can be resized at any time after cluster creation based on actual resource usage.
Configure task nodes with the following specifications:
| Resource | Recommended value |
|---|---|
| Nodes | >2 (for high availability) |
| CPU | >8 cores per node; increase based on connector CPU utilization |
| Memory | Based on connector type and memory usage |
| CPU-to-memory ratio | 1:2 or 1:4 |
| Data disk | >80 GiB cloud disk |