This topic lists the drawbacks of the scaling solutions available for open source Redis clusters and ApsaraDB for Redis cluster instances and describes the imperceptible scaling solution available for Tair cluster instances.
Elastic scaling of data nodes and data migration between shards are often required with open source Redis clusters. However, these common scaling solutions have issues such as slow key migration, unavailability of commands that involve multiple keys, inability to migrate Lua scripts, freezing or even high availability switchover triggered by large key migration, and complex rollback process.
In response to these issues, Tair developed a new architecture based on slot replication to ensure imperceptible data migration. This architecture optimizes the thread-scheduling algorithm within instances to allow them to be managed in an efficient and accurate manner. This is the principle of imperceptible scaling.
Common scaling solutions for open source Redis clusters and their drawbacks
Elastic scaling of open source Redis clustersOpen source Redis clusters use gossip protocols to transfer data. During data transfer, a slot is migrated as the smallest dataset by traversing and migrating keys in the slot. This scaling solution has the following issues:
- Low stability
- Commands that involve multiple keys in the same slot may fail to run because data is being migrated by key.
- Lua scripts cannot be replicated while data is being migrated. As such, Lua scripts may be lost after the data migration is complete.
- While data is being replicated, migration of large keys may cause freezing or even errors that may trigger high availability switchover.
- High O&M difficulty
- If an error is returned during data migration, you must manually restore the database data. This process is difficult, time-consuming, and error-prone.
- It takes a long time to scale clusters because migrating data by key is time-consuming. As such, scaling is often performed during off-peak hours, and business may be affected.
- Low stability
Scaling based on data synchronization and migration componentsThis solution relies on middleware components rather than open source Redis clusters to migrate data. For example, to perform scaling operations, you can create a cluster, use a middleware component to migrate data to the cluster, and then use a load balancer component to switch access paths. This scaling solution has the following issues:
- It takes a long time to synchronize full data.
- Costs are high because you must create two sets of resources to perform scaling operations.
- Clients are disconnected when a load balancer component performs a switchover. It takes a long time for the switchover to take effect, and the service may be unavailable for up to 10 seconds.
Imperceptible scaling solution for ApsaraDB for Redis cluster instances and its drawbacks
- Data migration performed during scaling operations affects instances, and instances
may become read-only for a few seconds.
Note Data migration needs may arise when you perform scaling operations. During data migration, when the proportion of synchronized incremental data to the total amount of incremental data to be synchronized is less than a threshold, the instance becomes read-only. After the remaining incremental data is synchronized, clients can re-connect to the instance by using a new endpoint. If an update request is sent to the instance while the instance is in the read-only state, the request is rejected and a read-only error is returned. Read-only errors cannot be smoothly handled, which may affect your business.
- Data migration performed during scaling operations competes for resources with regular operations. As such, scaling operations are often performed during off-peak hours and become less flexible.
Imperceptible scaling solution for Tair cluster instances
Tair cluster instances provide an imperceptible scaling solution built on a new architecture that efficiently handles operations on clusters with centralized control components.
- Imperceptible scaling
While a Tair cluster instance is being scaled, your clients are not affected, your business is not interrupted, and the instance does not remain in the read-only state. You can scale a Tair cluster instance at any time.
The key to imperceptible scaling is reducing the read-only time period on instances during data migration. Tair cluster instances dynamically estimate the amount of time required to migrate remaining incremental data and keep the read-only time period within milliseconds. This theoretically prevents instances from entering the read-only state because this read-only time period is far less than the TCP retransmission time that is measured in hundreds of milliseconds. When an instance remains in the read-only state, the write requests made for the keys to be migrated are cached to the instance instead of being written. After the data migration is complete, clients receive redirection messages. At the same time, the management system and the database engine work together to update instance information as soon as the data migration is complete. This process ensures that scaling operations are imperceptible to clients.
- Smooth scaling
Tair optimizes the thread-scheduling algorithm within cluster instances to implement fine-grained management of data migration tasks. This improves thread execution efficiency from 10% to a maximum of 80%. You can specify a custom efficiency value within this range. This way, the data migration speed is maximized without impacting your business. Tair cluster instances also support fine-grained scaling without increasing the reaction time (RT) to prevent high availability switchover caused by network jitter. This ensures high data reliability.
- Efficient and easy O&M
Tair cluster instances can address the scaling issues of open source Redis clusters by using the following methods:
- Pre-backup in the background: Pre-backup in the background can be implemented for Tair instances. This method does not affect online services, and the full data of your instance can be replicated in advance. This prevents freezing caused by large key migration.
- Rollback with a few clicks: You can roll back instances with a few clicks if exceptions occur during scaling.
- Data migration by slot: Data can be migrated by slot. This ensures that commands that involve multiple keys in the same slot can run as expected.
- Lua script replication: During data migration, Lua scripts can be replicated to prevent Lua script loss.
- Horizontal scaling: Up to 256 shards can be added to or deleted from a single instance.
- Low costs
Compared with solutions that require a middleware component, this solution reduces costs because you do not need to create two sets of resources.