The storage-operator component provides features for cross-zone migration and multi-zone spreading of stateful applications (StatefulSet) that use disk volumes, enabling automated cross-zone migration. When exceptions occur during migration, the storage-operator component can restore applications in the original zone through precheck and rollback mechanisms, ensuring business availability. This topic describes how to migrate stateful applications that use disk volumes across zones.
Scenarios
You might need to migrate deployed stateful applications to other zones when you encounter planning changes, application scale expansion requiring multi-zone deployment, or when resources in the current zone are insufficient.
NAS and OSS storage support cross-zone and multi-mount usage. However, disks themselves do not have the ability to float across zones and cannot reuse storage claims and volumes. In this case, you need to migrate stateful applications that use disk volumes to a new zone.
Applications that can accept business interruptions. For stateful applications with multiple replicas, to ensure data consistency, the application is scaled to 0 replicas before migration, and then restored to the original number of replicas all at once after disk migration is complete, rather than using rolling migration.
ImportantBusiness interruptions may occur during cross-zone migration of stateful applications. The interruption duration depends on factors such as the number of replicas, container startup speed, and disk capacity used.
How it works and the migration procedure
Cross-zone migration of applications using disks relies on the disk snapshot feature, and supports setting retention periods for newly created snapshots. For more information about disk snapshots, see Introduction to snapshots. For information about snapshot billing, see Snapshot billing.
The storage-operator component provides the following migration process for stateful applications that use disk volumes:
Performs relevant prechecks, such as checking whether the application to be migrated is running properly and whether there are disks that need to be migrated. If the check fails, migration will not proceed.
Scales the disk-type stateful application to 0 replicas, at which point the application is in a paused state.
Creates snapshots for disks mounted to the stateful application to be migrated. Snapshots support cross-zone usage.
After confirming that the snapshots are available, uses them to create new disks in the target zone. The new disks have the same data as the original disks.
Rebuilds storage claims with the same names and their corresponding new volumes, binding them to the new disks.
Restores the disk-type stateful application replicas to their original number, automatically associates them with the rebuilt storage claims, and actually mounts the new disks.
ImportantAfter the precheck is completed and migration begins, each step corresponds to a different failure rollback strategy. To ensure that applications can mount the original disks after rollback and avoid data loss, please confirm that the stateful application is running properly after migration before deleting the disks.
(Optional) After confirming that the stateful application is running properly, delete the original volumes and corresponding disks. For information about disk billing, see Block storage billing.
Usage notes
All storage used by the stateful application to be migrated must be ESSD disks.
To improve snapshot creation time, this feature uses instant access snapshots during migration. For specific operations, see Snapshot instant access. Currently, instant access snapshots only support ESSD disks. If your application uses non-ESSD type disks, you can handle it in the following ways:
Change the disk type to ESSD before migration. For more information, see Change the disk category.
Manually Create a snapshot for a single disk volume to rebuild the disk across zones.
The target zone must support ESSD disks, and the cluster must have nodes in the target zone that support ESSD disks available for scheduling.
Prerequisites
Kubernetes 1.20 or later is used by your cluster and the Container Storage Interface (CSI) plug-in is installed in your cluster.
The storage-operator component is installed in the cluster with version no lower than v1.26.2-1de13b6-aliyun. For more information, see Manage the storage-operator component.
The csi-plugin and csi-provisioner components are installed in the cluster, and the installed csi-provisioner is the non-managed version.
NoteIf the currently installed component is the managed version of csi-provisioner, you can uninstall the managed version and reinstall the non-managed version. After switching CSI components, you can run
kubectl delete pod -n kube-system <storage-controller-pod-name>to restart the storage controller.If your cluster is an ACK dedicated cluster, you must make sure that the worker Resource Access Management (RAM) role and master RAM role of your cluster have the permissions to call the ModifyDiskSpec operation of the Elastic Compute Service (ECS) API. For more information, see Create a custom policy.
NoteIf your cluster is an ACK managed cluster, you do not need to grant the permissions to call the ModifyDiskSpec operation to the cluster.
Usage
Modify the ConfigMap configuration in the cluster.
kubectl patch configmap/storage-operator \ -n kube-system \ --type merge \ -p '{"data":{"storage-controller":"{\"imageRep\":\"acs/storage-controller\",\"imageTag\":\"\",\"install\":\"true\",\"template\":\"/acs/templates/storage-controller/install.yaml\",\"type\":\"deployment\"}"}}'Create a stateful application migration task in the cluster.
cat <<EOF | kubectl apply -f - apiVersion: storage.alibabacloud.com/v1beta1 kind: ContainerStorageOperator metadata: name: default spec: operationType: APPMIGRATE operationParams: stsName: web stsNamespace: default stsType: kube targetZone: cn-beijing-h,cn-beijing-j checkWaitingMinutes: "1" healthDurationMinutes: "1" snapshotRetentionDays: "2" retainSourcePV: "true" EOFParameter
Required
Description
operationTypeRequired
Set the value to
APPMIGRATE, which indicates that the current operation is stateful application migration.stsNameRequired
The name of the stateful application. Currently only supports specifying a single stateful application.
NoteWhen deploying migration tasks for multiple stateful applications, the component will migrate them sequentially in the order of deployment time.
stsNamespaceRequired
The namespace where the stateful application is located.
targetZoneRequired
The list of target zones for migration. When there are multiple target zones, separate them with commas (,). For example,
cn-beijing-h,cn-beijing-j.When a disk mounted by the application is already in the list, the application will not be migrated.
When there is more than one target zone, the remaining disks will be migrated to each target zone in the order they appear in the list.
stsTypeOptional
The type of the specified stateful application, default is kube. Valid values:
kube: Native StatefulSet.
kruise: Advanced StatefulSet provided by the OpenKruise component.
checkWaitingMinutesOptional
The polling interval for status checks when the stateful application starts in the target zone after migration, in minutes.
Default is
"1", meaning it checks once per minute until the number of available replicas matches the expected number, or rolls back to the original zone after multiple failed check attempts.ImportantFor applications with many replicas, long image pull times, or long business startup times, you need to appropriately increase the polling interval to avoid application rollback due to too many retry attempts.
healthDurationMinutesOptional
The interval for secondary checks, in minutes. A secondary check is performed after the stateful application migration is complete and the number of available replicas matches the expected number. The system waits for the specified time before performing a secondary check to enhance migration reliability for data-sensitive businesses.
Default is
"0", meaning no secondary check is performed.snapshotRetentionDaysOptional
The retention period for newly created instant snapshots during migration, in days. Valid values:
"1": Default value, retained for one day."-1": Permanently retain the instant snapshot.
retainSourcePVOptional
Whether to retain the original disk and its corresponding volume resources in the cluster. Valid values:
"false": Default value, do not retain."true": Retain. You can log on to the ECS console to find the original disk instance, and the corresponding volume resources in the cluster will not be deleted. The volume will be in the Released state.
Examples
The test cluster is an ACK Pro cluster that contains multiple nodes from different zones, as shown below:
Zone B: cn-shanghai.192.168.5.245
Zone G: cn-shanghai.192.168.2.214
Zone M: cn-shanghai.192.168.3.236, cn-shanghai.192.168.3.237

Step 1: Create a stateful application that uses ESSD disks
Create a StatefulSet that uses ESSD disks in the cluster for subsequent testing. If you already have relevant test resources, you can skip this step.
Create a StatefulSet and mount ESSD disks.
Confirm the deployment status of pods in the StatefulSet.
kubectl get pod -o wide -l app=nginxThe example response below shows that according to the
Nodefield, both pods are scheduled to zone M.NoteThe actual zone deployment is determined by the scheduler.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES web-0 1/1 Running 0 2m 192.168.3.243 cn-shanghai.192.168.3.237 <none> <none> web-1 1/1 Running 0 2m 192.168.3.246 cn-shanghai.192.168.3.236 <none> <none>
Step 2: Create a stateful application migration task
Example 1: Cross-zone migration
Create a stateful application migration task.
In the following migration task example, all 2 pods of the StatefulSet will be migrated to the cn-shanghai-b zone.
ImportantBefore migration, please ensure that the node has sufficient resources and that both the zone and node specifications support ESSD disks.
cat <<EOF | kubectl apply -f - apiVersion: storage.alibabacloud.com/v1beta1 kind: ContainerStorageOperator metadata: name: migrate-to-b spec: operationType: APPMIGRATE operationParams: stsName: web stsNamespace: default stsType: kube targetZone: cn-shanghai-b # Target zone for migration. healthDurationMinutes: "1" # Wait 1 minute after migration to confirm the application is running properly. snapshotRetentionDays: "-1" # Newly created snapshots are retained long-term until deleted in the console. retainSourcePV: "true" # Retain the original zone's disks and corresponding PVs. EOFQuery the status of the migration task.
kubectl describe cso migrate-to-b | grep StatusThe expected response is as follows. If it returns
SUCCESS, it indicates that the migration task status is normal.Status: Status: SUCCESSNoteIf it returns
FAILED, it indicates that the migration task failed. Please refer to FAQ for troubleshooting.Query the deployment status of the 2 pods in the StatefulSet after migration.
kubectl get pod -o wide -l app=nginxThe example response below shows that both pods have been migrated to the
cn-shanghai.192.168.5.245node, corresponding to zone B.NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES web-0 1/1 Running 0 2m36s 192.168.5.250 cn-shanghai.192.168.5.245 <none> <none> web-1 1/1 Running 0 2m14s 192.168.5.2 cn-shanghai.192.168.5.245 <none> <none>Confirm that the migration task meets expectations in the ECS console.
On the Snapshots page, confirm that 2 new snapshots have been created and are permanently retained.
On the Block Storage page, confirm that 2 new disks have been created in zone B after migration, and the 2 disks in the original zone M have not been deleted (because the
retainSourcePVconfiguration in the migration task is set totrue).
Example 2: Multi-zone spreading
To improve application availability, you need to spread pods and disks across different zones.
Create a stateful application migration task.
In the following migration task example, the 2 pods of the StatefulSet will be spread across zones B and G.
cat <<EOF | kubectl apply -f - apiVersion: storage.alibabacloud.com/v1beta1 kind: ContainerStorageOperator metadata: name: migrate spec: operationType: APPMIGRATE operationParams: stsName: web stsNamespace: default stsType: kube targetZone: cn-shanghai-b,cn-shanghai-g # Target zones for migration. When multiple zones are configured, pods will be automatically spread. healthDurationMinutes: "1" # Wait 1 minute after migration to confirm the application is running properly. snapshotRetentionDays: "-1" # Newly created snapshots are retained long-term until deleted in the console. retainSourcePV: "true" # Retain the original zone's disks and corresponding PVs. EOFQuery the status of the migration task.
kubectl describe cso migrate | grep StatusThe expected response is as follows. If it returns
SUCCESS, it indicates that the migration task status is normal.Status: Status: SUCCESSNoteIf it returns
FAILED, it indicates that the migration task failed. Please refer to FAQ for troubleshooting.Query the deployment status of the 2 pods in the StatefulSet after migration.
kubectl get pod -o wide -l app=nginxThe example response below shows that the 2 pods have been spread across the
cn-shanghai.192.168.5.245node (zone B) and thecn-shanghai.192.168.2.214node (zone G).NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES web-0 1/1 Running 0 4m59s 192.168.2.215 cn-shanghai.192.168.2.214 <none> <none> web-1 1/1 Running 0 4m38s 192.168.5.250 cn-shanghai.192.168.5.245 <none> <none>Confirm that the migration task meets expectations in the ECS console.
On the Snapshots page, confirm that 2 new snapshots have been created and are permanently retained.
On the Block Storage page, confirm that 2 new disks have been created in zones B and G after migration, and the 2 disks in the original zone M have not been deleted (because the
retainSourcePVconfiguration in the migration task is set totrue).
FAQ
If the migration task status is FAILED, you can use the following command to query the failure reason, adjust accordingly, and retry.
kubectl describe cso <ContainerStorageOperator-name> | grep Message -A 1The example response below indicates that the failure is due to not being able to find the storage claim to be migrated. Possible reasons include the application not having mounted storage, the application already being mounted in the target zone, or inability to obtain storage claim information. Please modify as needed and retry.
Message:
Consume: failed to get target pvc, err: no pvc mounted in statefulset or no pvc need to migrated web