New-Gen Cluster Non-Inductive Data Migration of Alibaba Cloud In-Memory Database Tair

By Yuxun

Redis is a popular in-memory database used in a wide variety of business scenarios. The open-source version of Redis provides a distributed Redis cluster solution that can be used to improve memory capacity storage and high-speed performance. The cluster architecture inevitably involves elastic scaling of data node shards and data migration between shards. However, the data migration capability of the Redis Community Edition cluster has always been a pain point for developers and O&M personnel.

In order to overcome the shortcomings of data migration in Redis Community Edition, Alibaba Cloud developed Tair. Tair has launched a new generation of non-inductive data migration architecture based on the principle of slot replication, which has provided long-term and stable services on Alibaba Cloud.

Tair is a cloud-native in-memory database developed by Alibaba Cloud, fully compatible with Redis and provides rich data models and enterprise-level capabilities to help customers build real-time online scenarios. At the same time, Tair combines with a new type of storage media named persistent memory, which reduces the cost by more than 30% compared to memory. Also, Tair can realize persistent data and provide the performance similar to the memory. At present, Tair has been widely used for customers in various industries such as finance, manufacturing, healthcare, and internet to meet customers' requirements for high-speed query and computing scenarios.

This article describes the technology of data migration of open-source Redis Community Edition, the early enhancement and improvement of Alibaba Cloud Tair for Redis clusters for community data migration, and principle of the evolution to a new generation of data migration based on slot replication.

1. Data Migration Technology for the Open-source Redis Cluster

The open-source Redis Cluster (7.0) uses a distributed architecture without central control nodes. The cluster topology and other meta-information are transferred between nodes through the gossip protocol. The cluster topology takes slot units as the smallest set of data, and each node belongs to a part of slots. Data migration is to move slots between nodes.

The open-source Redis cluster migrates slots by using Per Key that refers to the migration of a single slot by traversing and migrating some keys.

Process of Per Key Migration

Specific Steps:

Set the importing state on the targeting shard.
Set the migrating state on the source shard.
Obtain the key list of the migrated slots from the source shard.
Send the migrate command to the source shard with some key lists to be migrated. The migrate command is a synchronous blocking command that triggers the source shard to delete keys after it sends the keys to the targeting shard to be restored. It is repeated until the key list of migrated slots on the source shard is empty.
Run the setslot command to assign the migrated slots to targeting nodes.

Pain Points of Per Key Migration

This process also introduces obvious problems:

During the execution of the migration command, other business requests cannot be processed due to synchronous blocking. Migrating big keys will cause long-term blocking and further amplify the problem of service availability.
If the key accessed in the source shard does not exist, it is redirected to the targeting shard and one more routing is forwarded.
After the multi-key command request is redirected to the targeting shard, if some keys on the targeting shard do not exist, the targeting shard will report a TryAgain error to the client. This multi-key command request fails.
During the migration, the data of a single slot is scattered across two shards. If an exception occurs, the entire migration process cannot be smoothly rolled back.
Early Alibaba Cloud Tair for Redis cluster also enhanced the technology of Per Key data migration.

2. Enhanced Technology for Early Per Key Data Migration in Tair for Redis Cluster

The enhancement and improvement of Per Key data migration of Tair for Redis mainly focus on improving synchronous blocking, mitigating big key migration blocking, and shortening migration interaction time, so as to increase migration speed and reduce the loss-aware impact of business. Specific improvement points are described below.

Change the Synchronous Process to Asynchronous Process

The source node performs the following three operations when executing the native migrate command:

Dump keys and serialize the payload of the packaged keys.
Send the packaged payload to the targeting node through the restore command.
Delete keys after receiving the reply from the targeting node.

The three operations are executed in one synchronous blocking command, thus causing other business requests to be inaccessible for a long time during the execution.

Tair for Redis uses the exclusive kernel migration state machine to split these three steps into three asynchronous processes to reduce the duration of synchronous blocking.

Split a Big Key into Chunks for Migration

If Per Key is migrated to a big key, the time consumed for the three phases of dumping keys, transferring payload, and restoring keys will increase accordingly. Even asynchronization cannot reduce the impact on time consumed in a single operation.

Before the Tair for Redis migration, it needs to determine whether the migrated key belongs to a big key. The big key is decomposed into chunks. The three stages of dumping keys, transferring payload, and restoring keys are completed by chunk. The time consumed in each stage is reduced, which can effectively reduce the impact of access blocking on other keys, but the overall time consumed in migration of the big key is prolonged.

Unresolved Pain Points

The optimization and improvement cannot solve all the pain points. Here are pain points that still exist:

The moved and ask redirection routing semantics throughout the migration process
A multi-key request with an empty key returns a TryAgain error to the client.
Cannot roll back stably with one key.

After continuous exploration and evolution, Alibaba Cloud Tair for Redis cluster has introduced a new generation of data migration technology based on slot replication.

3. Non-Inductive Data Migration Technology for Tair for Redis Cluster based on Slot Replication

The primary and secondary Redis instances synchronize data through data replication. Based on the technology of primary/secondary replication, Tair for Redis has derived slot replication (Slot Mig) in its kernel. In general, it is to copy a part of slots between two nodes for data synchronization, and add the central controller (CS) to accurately and dynamically control the millisecond-level Slot Wait write prohibition technology. When the cluster topology is switched, the client requests are redirected to the new data node through moved semantics. Thus, the read and write requests of business are lossless, and the migration process does not affect the user's business.

Process of Slot Mig Migration

Specific Steps:

CS initiates the data migration command slotRepl for the targeting shard.
The targeting node initiates the slotPsync command for the source node.
The source node generates the existing snapshot data of the migrated slots.
The targeting node receives the existing snapshot data of the migrated slots.
The incremental data is synchronized from the source node to the migrated slots of the targeting node.
CS dynamically evaluates differences of incremental synchronization, sends slotReplWait to the source node within the difference range, and prohibits writes to the slot that is being migrated. The difference is measured in milliseconds. If the write prohibition duration exceeds the threshold, the system automatically is triggered to roll back the write prohibition to restore the writable status of business.
CS switches the cluster topology after waiting for synchronization to be completed. The old requests of the client are redirected to the new cluster topology nodes according to the moved semantics.
Delete the migrated Slot data of the source node.

Benefits of Slot Mig Migration

As mentioned above, here are the benefits of the entire migration process:

No read-only errors occur, and the client does not experience transient connection errors.
Only after switching the cluster topology, the moved semantics is returned to old connections.
Big-key and multi-key requests are not affected.
The data of a single slot is no longer scattered on two nodes. The process can be rolled back and restored with one click.

4. Summary

Let's summarize the features of the new generation of non-inductive data migration of Alibaba Cloud Tair for Redis cluster and the data migration of the open-source Redis Community Edition.

The non-inductive data migration technology of Alibaba Cloud Tair for Redis cluster has already applied to the feature of specification changes (scale-out or scale-in nodes) on Alibaba Cloud. You are welcome to purchase and try it.

Community

New-Gen Cluster Non-Inductive Data Migration of Alibaba Cloud In-Memory Database Tair

1. Data Migration Technology for the Open-source Redis Cluster

Process of Per Key Migration

Pain Points of Per Key Migration

2. Enhanced Technology for Early Per Key Data Migration in Tair for Redis Cluster

Change the Synchronous Process to Asynchronous Process

Split a Big Key into Chunks for Migration

Unresolved Pain Points

3. Non-Inductive Data Migration Technology for Tair for Redis Cluster based on Slot Replication

Process of Slot Mig Migration

Benefits of Slot Mig Migration

4. Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Bastionhost

Managed Service for Grafana

ApsaraDB for Redis

Tair