A Kafka cluster can be deployed on Elastic Compute Service (ECS) instances on which Elastic Remote Direct Memory Access (eRDMA) is enabled. The Kafka cluster can fully utilize the low latency, high throughput, and low CPU utilization provided by eRDMA to improve the efficiency of data transmission between nodes in the Kafka cluster. The Kafka cluster is suitable for scenarios that require high message throughput and low latency. This topic describes how to deploy a Kafka cluster on eRDMA-capable ECS instances. This topic also describes how to test Kafka performance that is improved by eRDMA.
Kafka is a distributed stream processing platform that efficiently processes and stores a large number of data streams and supports real-time message publishing and subscription. Kafka is widely used in scenarios such as log aggregation, event sourcing, and real-time analytics. For more information, see Kafka documentation.
eRDMA is a Remote Direct Memory Access (RDMA) service developed by Alibaba Cloud to ensure high network performance with low latency, high throughput, and high elasticity. For more information, see Overview.
Step 1: Prepare ECS instances
Before you deploy a Kafka cluster, prepare multiple ECS instances and deploy the Broker and ZooKeeper services and build a stress testing environment on the instances.
A Broker-enabled instance serves as a core data node in a cluster to store, transmit, and manage messages.
A ZooKeeper-enabled instance implements distributed service coordination and management in the Kafka cluster.
A stress testing instance is used to test the performance of the deployed Kafka cluster.
In this example, five ECS instances are created, which include one ZooKeeper-enabled instance, three Broker-enabled instances, and one stress testing instance. The following table describes the requirements for the instance configurations.
The selected instance types must support eRDMA. For information about the instance types that support eRDMA, see the Limits section in the "Configure eRDMA on an enterprise-level instance" topic.
Purpose | Instance requirement | Disk requirement | Network requirement | Image requirement |
Broker-enabled instance | Create three instances. In this example, the ecs.g8a.2xlarge instance type is used. | Select Enterprise SSDs (ESSDs) at performance level 3 (PL3). Select the disk capacity based on your business requirements. |
| Alibaba Cloud Linux 3.2104 LTS 64-bit. |
Zookeeper-enabled instance | Create one instance. In this example, the ecs.g8a.xlarge instance type is used. | None. | ||
Stress testing instance | Create one instance. In this example, the ecs.g8a.16xlarge instance type is used. | None. |
Step 2: Install required tools and Kafka
Log on to each of the five ECS instances prepared in Step 1 and run the relevant commands to install Shared Memory Communications over Remote Direct Memory Access (SMC-R), the Java tool, and Kafka.
Before you can use the eRDMA feature, you must deploy SMC-R. SMC-R works in the kernel space, and the SMC-R protocol stack helps instances use, manage, and maintain eRDMA resources. For information about SMC-R, see Use SMC.
Log on to all ECS instances in sequence.
For more information, see Connect to a Linux instance by using a password or key.
(Conditionally required) Run the
uname -rcommand to view the kernel version. Make sure that the kernel version of all instances is5.10.134-16.3or later. If the kernel version of an instance is earlier than5.10.134-16.3, run the following commands to upgrade the kernel to the latest version:sudo yum update kernel sudo rebootRun the following command to install the smc-tools toolkit on each instance:
sudo yum install smc-tools -yRun the following command to check whether eRDMA is enabled on each instance:
smcr devSample output:
Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID eth0 erdma_0 1 ACTIVE 0x107f No 0Run the following command to disable IPv6 on each instance.
NoteAlibaba Cloud eRDMA devices and SMC devices do not support IPv6 addresses. After you disable IPv6, traffic can pass through RDMA channels by using IPv4.
sudo sysctl net.ipv6.conf.all.disable_ipv6=1Run the following command to install Java and Git:
sudo yum install java-11-openjdk-1:11.0.21.0.9-2.0.3.al8 java-11-openjdk-devel-1:11.0.21.0.9-2.0.3.al8 git -yRun the following commands to download and decompress the Kafka package:
wget https://archive.apache.org/dist/kafka/3.5.0/kafka_2.13-3.5.0.tgz tar -xf kafka_2.13-3.5.0.tgz
Step 3: Start ZooKeeper and Broker for Kafka
Log on to all ECS instances in sequence.
Add the mapping between the private IP address and hostname of an instance to the
/etc/hostsfile on each instance.
Run the following command on the ZooKeeper-enabled instance to start ZooKeeper:
bash $HOME/kafka_2.13-3.5.0/bin/zookeeper-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/zookeeper.propertiesStart Broker on each of the three Broker-enabled instances.
NoteIf you perform a test without using the eRDMA feature, remove the
smc_runparameter from the command.On the first Broker-enabled instance, set the Broker ID to
0and start Broker. Replace<zookeeper ip>with the private IP address of the Zookeeper-enabled instance.KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=0 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181On the second Broker-enabled instance, set the Broker ID to
1and start Broker. Replace<zookeeper ip>with the private IP address of the Zookeeper-enabled instance.KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=1 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181On the third Broker-enabled instance, set the Broker ID to
2and start Broker. Replace<zookeeper ip>with the private IP address of the Zookeeper-enabled instance.KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=2 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
Step 4: Perform performance tests
Download the Benchmark tool and simulate an environment for which the highest available network bandwidth is configured. Test Kafka performance when eRDMA is enabled and when eRDMA is disabled. Compare the test results to evaluate the performance enhancement that eRDMA contributes to the Kafka cluster.
Log on to the stress testing instance. Download and compile Open Messaging Benchmark.
Download and install Maven that is an Open Messaging Benchmark compiler.
wget https://dlcdn.apache.org/maven/maven-3/3.8.8/binaries/apache-maven-3.8.8-bin.tar.gz tar -xf apache-maven-3.8.8-bin.tar.gz export PATH=$PATH:$HOME/apache-maven-3.8.8/bin/Modify the configuration file of the Maven repository to accelerate downloads.
vi $HOME/apache-maven-3.8.8/conf/settings.xmlAdd the following content to the
settings.xml mirrorstag. Then, save and close the configuration file.<mirror> <id>nexus-aliyun</id> <mirrorOf>central</mirrorOf> <name>Nexus aliyun</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </mirror>Download and compile the Open Messaging Benchmark source code.
git clone https://github.com/openmessaging/benchmark.git cd benchmark && mvn clean verify -DskipTests
Specify the private IP addresses of Broker-enabled instances in the kafka-throughput.yaml file.
vi $HOME/benchmark/driver-kafka/kafka-throughput.yamlIn the kafka-throughput.yaml file, set the
bootstrap.serversparameter to the private IP addresses of Broker-enabled instances in the following format:<Private IP address of Broker 0>:9092,<Private IP address of Broker 1>:9092,<Private IP address of Broker 2>:9092.commonConfig: | bootstrap.servers=<172.17.XX.XX>:9092,<172.17.XX.XX>:9092,<172.17.XX.XX>:9092 default.api.timeout.ms=1200000 request.timeout.ms=1200000Specify the rate at which Kafka messages are sent to simulate the highest available network bandwidth for the environment in which the performance of the Kafka cluster is tested.
vi $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-200k.yamlChange the
producerRate: <Message sending rate>parameter in the file. The message sending rate is calculated by using the following formula:Available bandwidth of a Broker-enabled instance/Size of a single message.In this example, the instance type of the Broker-enabled instances is ecs.g8a.2xlarge. The maximum network bandwidth of this instance type is 4 Gbit/s, and the total bandwidth of the three Broker-enabled instances is 12 Gbit/s. Due to the Kafka triplicate mechanism, the actual available bandwidth is about one third of the total bandwidth. Therefore, the
available bandwidth of a Broker-enabled instanceis calculated as 12 Gbit/s divided by 3, which equals 4 Gbit/s (or 512 MB/s). Given that the test is conducted in theworkloadsdirectory, where the size of each message is 1 KB. Themessage sending rateis calculated as 512 MB/s divided by 1 KB, which equals 524,288. In this case, change theproducerRate: <Message sending rate>parameter toproducerRate: 524288in the file. In the actual test, set the message sending rate based on your business requirements.Use one of the following methods to test the performance of the Kafka cluster.
Perform a performance test when eRDMA is enabled
smc_run $HOME/benchmark/bin/benchmark --drivers $HOME/benchmark/driver-kafka/kafka-throughput.yaml $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-2000k.yamlDuring the Kafka performance test, you can perform the following operations at the same time:
Run the
smcss -acommand in another window on the stress testing instance to check whether SMC-R is used to transmit messages.Run the
sarcommand on each of the three Broker-enabled instances to check CPU utilization. For example, thesar 1 20command indicates that data is sampled once every 1 second for 20 times. Then, the CPU utilization values of the three Broker-enabled instances are summed up to obtain the total CPU utilization.
Perform a performance test when eRDMA is disabled
$HOME/benchmark/bin/benchmark --drivers $HOME/benchmark/driver-kafka/kafka-throughput.yaml $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-2000k.yamlImportantTo prevent remaining test data from adversely affecting the performance test in which eRDMA is disabled, delete the Broker and ZooKeeper records and restart Broker and Zookeeper on the instances.
Obtain the latency information from the results of the two tests to evaluate the performance enhancement that eRDMA contributes to the Kafka cluster.
Find and view the last
Aggregated Pub Latency (ms)entry.avgrepresents the average latency, 99% represents the P99 latency, and 999% represents the P999 latency.