By Chen Xingyu (Yumu), Technical Expert at Alibaba Cloud on Basic Technology Mid-Ends
Etcd stores key metadata on a container cloud platform. Alibaba has been using etcd for three years and it assumed a critical role during the 2019 Double 11 Global Shopping Festival. This article introduces our best practices for optimizing etcd server performance and using the etcd client. We hope to help you run etcd clusters stably and efficiently.
Etcd was developed by CoreOS using Golang. It is a distributed key-value storage engine. etcd can be used as a database to store the metadata of a distributed system. etcd is widely used by major companies.
The following figure shows the basic architecture of etcd.
A cluster has three nodes: one leader and two followers. Each node synchronizes data by using the Raft algorithm and stores data in BoltDB. When one node fails, other nodes automatically elect a new leader to maintain the high availability of the cluster. The etcd client can complete a request by connecting to any node.
The preceding figure shows a standard etcd cluster architecture. An etcd cluster can be divided into the Raft layer (blue) and the storage layer (red). The storage layer is further divided into the treeIndex layer and BoltDB layer for persistent key-value storage. Each of these layers may cause performance loss on the part of etcd.
The Raft layer synchronizes data through a network. The etcd performance may be affected by the round trip time (RTT) and bandwidth between I/O nodes in the network. Write-ahead logging (WAL) may be affected by the disk I/O write speed.
At the storage layer, the etcd performance may be affected by disk I/O fdatasync latency and lock blocks of the treeIndex layer. The etcd performance may be greatly affected by the BoltDB Tx lock and the BoltDB performance.
Further, the etcd performance is affected by the kernel parameters of the etcd host and the latency of the gRPC API layer.
The following shows how to optimize the etcd server performance.
The etcd server requires sufficient CPU and memory resources to keep etcd running. etcd is a disk I/O-dependent database program that requires solid state disks (SSDs) with low I/O latency and high throughput. etcd is also a distributed key-value storage system that requires good network conditions to run properly. Therefore, deploy etcd independently from other programs running on the host to prevent their impact on etcd performance.
For more information about the official configuration of etcd, click here.
The etcd software is divided into several layers. The following shows how to optimize the etcd performance at these layers. To obtain the related code, visit GitHub PR.
For more information, visit the following link:
For more information, visit the following link:
The following introduces a performance optimization made by Alibaba. This performance optimization significantly improves the internal storage performance of etcd through a new algorithm for allocating and reclaiming etcd internal storage in the freelist based on the segregated hashmap.
The preceding figure shows a single-node etcd architecture, in which BoltDB persistently stores all key-value data. The BoltDB performance is essential for the overall performance of etcd. A large amount of Alibaba metadata is stored in etcd. This exposes some of etcd's performance problems.
The preceding figure shows a core algorithm for allocating and reclaiming the internal storage of etcd. By default, etcd uses 4 KB pages to store data. As shown in the figure, the numbers indicate the page IDs. Pages in red are being used, whereas pages in white are not in use.
When data is deleted, etcd does not immediately return the storage space to the system, but keeps it in a page pool. This makes it more efficient to reuse the storage space. This page pool is called freelist. As shown in the figure, the freelist keeps pages 43, 45, 46, 50, and 53 in use, and also keeps unused pages 42, 44, 47, 48, 49, 51, and 52.
When new data needs to be stored on consecutive pages with a size of 3, the old algorithm scans from the freelist header and returns the start page ID 47. The linear freelist scanning algorithm has low performance when there is a large amount of data or a lot of internal fragments in the freelist.
To solve this problem, we have designed and implemented a new freelist allocation and reclamation algorithm based on the segregated hashmap. The algorithm uses the consecutive page size as the hashmap key, and the value is the configuration set of the start page ID. When data needs to be stored on new pages, you only need to query the hashmap value with time complexity O(1) to quickly get the start page ID.
When data needs to be stored on consecutive pages with a size of 3, you can query the hashmap to quickly get the start page ID 47.
We also optimized the page release process by using the hashmap. For example, when pages 45 and 46 are released, related pages are merged with the previous and next pages to form a large continuous page starting from page 44 and with a size of 6.
The new algorithm reduces the time complexity of allocation from O(n) to O(1) and that of reclamation from O(nlogn) to O(1). etcd no longer imposes limits on the read and write performance of its internal storage, and the etcd performance is improved dozens of times over. The recommended storage for a single cluster is scaled up from 2 GB to 100 GB. This optimization is currently used within Alibaba and is available to the open-source community.
These software optimizations are all available in the new etcd version.
The following introduces the best practices for ensuring optimal etcd client performance.
The etcd server provides the following APIs to the etcd clients: Put, Get, Watch, Transactions, and Leases.
We use the following best practices when calling these APIs on the etcd client:
Observe the preceding best practices when using the etcd client to ensure that your etcd cluster runs stably and efficiently.
Let's summarize what we have learned in this article.
I hope that this article can help you run your etcd cluster stably and efficiently.
Alibaba Developer - June 15, 2020
Alibaba Developer - January 10, 2020
Alibaba Developer - June 18, 2020
Alibaba Developer - February 26, 2020
Alibaba Developer - April 1, 2020
Alibaba Developer - April 7, 2020
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.Learn More
Accelerate and secure the development, deployment, and management of containerized applications cost-effectively.Learn More
A secure image hosting platform providing containerized image lifecycle managementLearn More
High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.Learn More
More Posts by Alibaba Developer