All Products
Search
Document Center

Container Service for Kubernetes:FAQ about the backup center

Last Updated:Mar 26, 2026

This topic answers common questions about the backup center.

Understand job status

Before troubleshooting, understand what each status means. A Completed status does not guarantee that all resources were processed successfully — always check the Errors and Warnings fields.

StatusMeaning
CompletedThe job finished. If resources are missing from the restore cluster, check the Warnings field — resources may have been excluded by configuration or recycled by business logic.
PartiallyFailedThe job completed but failed to process some resources. Check the Errors and Warnings fields in the job details.
FailedThe job did not complete. Retrieve error details using the methods below.
InProgressThe job is still running. If a job stays in this state for an extended period, see Why is my job stuck in InProgress?

Get error details

When a backup job, StorageClass conversion task, or restore job shows Failed or PartiallyFailed, use the following methods to retrieve error details.

Quick view: Hover over Failed or PartiallyFailed in the Status column for a brief error message, such as RestoreError: snapshot cross region request failed. image.png

Full error details: Run the command for your task type to view all events, including detailed error messages.

  • Backup job:

    kubectl -n csdr describe applicationbackup <backup-name>
  • StorageClass conversion task:

    kubectl -n csdr describe converttosnapshot <backup-name>
  • Restore job:

    kubectl -n csdr describe applicationrestore <restore-name>
If you use the backup center with kubectl, upgrade the migrate-controller component to the latest version before troubleshooting. This does not affect existing backups. For more information, see Manage components.

Console issues

Why does the console show "The working component is abnormal" or "Failed to fetch current data"?

The backup center component was not installed correctly. Check the following:

Why does the console show "The name has been used. Change the name and try again"?

When you delete a task, the system creates a deleterequest resource in the cluster and runs a series of deletion operations — not just deleting the backup resource itself. If the deletion fails or is interrupted, some resources may remain in the cluster with the same name, causing this error.

Run the following command to delete the conflicting resource. For example, if the error is deleterequests.csdr.alibabacloud.com "xxxxx-dbr" already exists:

kubectl -n csdr delete deleterequests xxxxx-dbr

Then create the task with a new name.

Why can't I select an existing backup when restoring across clusters?

The most common cause is that the backup vault has not been initialized in the target cluster. On the Restore page, find Backup Vault and click Initialize Backup Vault. After initialization completes, select the backup and restore.

If initialization fails, the backuplocation resource in the target cluster shows an Unavailable status. Run the following command to check:

kubectl get -n csdr backuplocation <backuplocation-name>

Expected output:

NAME                    PHASE       LAST VALIDATED   AGE
<backuplocation-name>   Available   3m36s            38m

If the status is Unavailable, see Why does my job fail with "VaultError: xxx"?.

If the backup vault status is fine, confirm in the source cluster console that the backup job shows Completed. Failed or in-progress backup jobs cannot be selected for cross-cluster restoration.

Why does the console show "The service role required by the current component has not been authorized" (AddonRoleNotAuthorized)?

Starting from migrate-controller 1.8.0, the cloud resource authentication logic for ACK managed clusters was updated. The first time you install or upgrade to this version, the Alibaba Cloud account must complete authorization.

  • If you are logged in with an Alibaba Cloud account, click Authorize.

  • If you are logged in as a RAM user, click Copy Authorization Link and send it to the Alibaba Cloud account holder for authorization.

Why does the console show "The current account has not been granted the cluster RBAC permissions required for this operation" (APISERVER.403)?

The console interacts with the API server to submit and monitor backup and restore jobs. The default permission sets for O&M engineers and developers are missing some permissions required by the backup center.

Grant the following ClusterRole permissions to backup center operators. For instructions, see Use custom RBAC roles to restrict resource operations in a cluster.

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: csdr-console
rules:
  - apiGroups: ["csdr.alibabacloud.com","velero.io"]
    resources: ['*']
    verbs: ["get","create","delete","update","patch","watch","list","deletecollection"]
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["get","list"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get","list"]

Why does the backup center component fail to upgrade or uninstall, and the csdr namespace stays in Terminating?

The backup center exited abnormally, leaving jobs in the InProgress state. The finalizers field on these jobs is blocking resource deletion.

Run the following command to identify what is blocking the namespace:

kubectl describe ns csdr

Confirm the stuck jobs are no longer needed and delete their finalizers. After the csdr namespace is deleted:

  • For upgrades, reinstall the migrate-controller component.

  • For uninstalls, the component is now removed.

General job failures

Why does my job fail with "internal error"?

The component or an underlying cloud service encountered an unexpected exception. For example, the cloud service may not be available in the current region.

If the error is HBR backup/restore internal error, check the Cloud Backup console to verify that the container backup feature is available in your region.

Why does my job fail with "create cluster resources timeout"?

During a StorageClass conversion or restoration, the backup center creates temporary pods, persistent volume claims (PVCs), and persistent volumes (PVs). If these resources stay in an unavailable state for too long, this timeout error occurs.

  1. Identify the stuck resource:

    kubectl -n csdr describe <applicationbackup/converttosnapshot/applicationrestore> <task-name>

    For example, an output like wait for created tmp pvc default/demo-pvc-for-convert202311151045 for convertion bound time out means the PVC demo-pvc-for-convert202311151045 in the default namespace is not binding.

  2. Check the PVC status to find the root cause:

    kubectl -n default describe pvc demo-pvc-for-convert202311151045

Common causes include:

  • Insufficient cluster or node resources.

  • The restore cluster is missing the required storage class. Use the StorageClass conversion feature to select an available storage class before restoring.

  • The underlying storage of the storage class is unavailable — for example, the specified disk type is not supported in the current zone.

  • The Container Network File System (CNFS) associated with alibabacloud-cnfs-nas is abnormal. See Use CNFS to manage NAS file systems (recommended).

  • A storage class with volumeBindingMode: Immediate was selected in a multi-zone cluster.

For storage troubleshooting guidance, see Troubleshoot storage issues.

Why does my job fail with "addon status is abnormal"?

The components in the csdr namespace are abnormal. Check their status:

kubectl get pod -n csdr
kubectl describe pod <pod-name> -n csdr

For resolution steps, see Why is my job stuck in InProgress?.

Why does my job fail with "VaultError: xxx"?

This error means the backup vault cannot reach the Object Storage Service (OSS) bucket. Work through the following checks in order.

1. Verify the OSS bucket exists.

Log on to the OSS console and confirm the bucket associated with the backup vault exists. If it does not, create a new bucket and re-associate it. See Create buckets.

Important

You cannot create a backup vault with the same name as a deleted one, nor associate a vault with an OSS bucket whose name does not follow the cnfs-oss-* format. If you have an existing vault associated with an incorrectly named bucket, create a new vault with a different name and associate it with a cnfs-oss-* bucket.

2. Verify OSS access permissions.

The required steps depend on your cluster type:

  • ACK Pro clusters: No OSS permission configuration is needed if the OSS bucket name starts with cnfs-oss-.

  • ACK dedicated clusters and registered clusters: Configure OSS permissions as described in Install migrate-controller and grant permissions.

  • ACK managed clusters where the component was installed or upgraded to v1.8.0 or later outside the console: run the following command to check whether OSS permissions are configured:

    kubectl get secret -n kube-system | grep addon.aliyuncsmanagedbackuprestorerole.token

    Expected output:

    addon.aliyuncsmanagedbackuprestorerole.token          Opaque                      1      62d

    If the output matches, permissions are in place. If not, grant permissions using one of these methods:

3. Check the network configuration.

kubectl get backuplocation <backuplocation-name> -n csdr -o yaml | grep network
  • network: internal — the vault accesses OSS over the internal network.

  • network: public — the vault accesses OSS over the Internet. If this causes a timeout, verify that the cluster can access the Internet. See Enable an existing ACK cluster to access the Internet.

The backup vault must use public network access in these scenarios:

  • The cluster and the OSS bucket are in different regions.

  • The cluster is an ACK Edge cluster.

  • The cluster is a registered cluster that is not connected to a VPC via Cloud Enterprise Network (CEN), Express Connect, or VPN — or a route to the internal OSS CIDR block of the region is not configured.

To switch to public network access, run:

kubectl patch -n csdr backuplocation/<backuplocation-name> --type='json' -p \
  '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'
kubectl patch -n csdr backupstoragelocation/<backuplocation-name> --type='json' -p \
  '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'

Replace <region-id> with the region of the OSS bucket, such as cn-hangzhou.

Why does my job fail with "HBRError: check HBR vault error"?

Cloud Backup is not activated or lacks the required permissions.

  1. Activate the Cloud Backup service. See Enable Cloud Backup.

  2. For clusters in China (Ulanqab), China (Heyuan), or China (Guangzhou), also authorize Cloud Backup to access API Gateway after activation. See (Optional) Step 3: Authorize the Cloud Backup service to access API Gateway.

  3. For ACK dedicated clusters and registered clusters, verify that Cloud Backup RAM (Resource Access Management) permissions are granted. See Install the migrate-controller backup service component and configure permissions.

Why does my job fail with "HBRError: ... code: 400, Illegal request. Please modify the parameters"?

The ack-backup-data Cloud Backup repository in the region was deleted.

When you first create a backup in a region, the backup center automatically creates an ack-backup-data repository to store backups. If this repository is deleted, subsequent jobs fail with this error.

Important

After the repository is deleted, backups created before the deletion cannot be restored. The following steps only create a new repository for future backups.

  1. In all clusters using the backup center in the affected region, clear the backup vault records:

    kubectl -ncsdr delete backuplocation --all
    kubectl -ncsdr delete backupstoragelocation --all
  2. Return to the cluster and create a new backup. The component automatically creates a new ack-backup-data repository and associates it with the backup vault.

Why does my job fail with "hbr task finished with unexpected status: FAILED, errMsg ClientNotExist"?

The Cloud Backup client (the hbr-client DaemonSet) on a node in the csdr namespace is not running correctly.

  1. Check for abnormal hbr-client pods:

    kubectl -n csdr get pod -lapp=hbr-client
  2. If any pods are in an abnormal state, check whether the cause is insufficient pod IP addresses, memory, or CPU. For pods in CrashLoopBackOff state, view the logs:

    kubectl -n csdr logs -p <hbr-client-pod-name>

    If the logs contain SDKError: StatusCode: 403, Code: MagpieBridgeSlrNotExist, follow (Optional) Step 3: Authorize Cloud Backup to access API Gateway to grant the required permissions.

  3. For other SDK errors, use the EC error code to troubleshoot. See Troubleshoot issues using EC error codes.

Why is my job stuck in InProgress?

Cause 1: Components in the csdr namespace are abnormal

Check whether the components are restarting or failing to start:

kubectl get pod -n csdr
kubectl describe pod <pod-name> -n csdr

If the cause is out-of-memory (OOM):

  • If the affected pod is csdr-velero-* and your restore cluster runs many production namespaces, Velero's Informer Cache may be consuming too much memory. To disable it, add --disable-informer-cache=true to the migrate-controller args:

    Disabling the Informer Cache reduces memory usage but may affect job performance. Monitor job performance after making this change.
    kubectl -nkube-system edit deploy migrate-controller

    Add the parameter to the container's args:

    name: migrate-controller
    args:
      - --disable-informer-cache=true
  • To increase the memory limit without disabling the cache, run:

    kubectl patch deploy <deploy-name> -p '{"spec":{"containers":{"resources":{"limits":"<new-limit-memory>"}}}}'

    Use csdr-controller for csdr-controller-* pods and csdr-velero for csdr-velero-* pods.

If the cause is missing Cloud Backup permissions:

  1. Confirm Cloud Backup is activated. If not, activate it at Cloud Backup.

  2. For ACK dedicated clusters and registered clusters, confirm that Cloud Backup permissions are configured. See Install migrate-controller and grant permissions.

  3. Check whether the token required by the Cloud Backup client exists:

    1. Find the node where hbr-client-* is running:

      kubectl get pod <hbr-client-***> -n csdr -owide
    2. Change the node label from true to false:

      kubectl label node <node-name> csdr.alibabacloud.com/agent-enable=false --overwrite
    3. Describe the hbr-client pod:

      kubectl describe <hbr-client-***>

      If the events show couldn\'t find key HBR_TOKEN, the token is missing.

    Important

    The token is automatically recreated the next time you run a backup or restore. If you copy a token from another cluster, the started hbr-client will not be active. Delete the copied token and the associated hbr-client-* pod, then repeat the steps above.

Cause 2: Disk snapshot permissions are not configured

If an application has disk volumes mounted and the backup job stays in InProgress, check the VolumeSnapshot resources:

kubectl get volumesnapshot -n <backup-namespace>

If the READYTOUSE field stays false for all VolumeSnapshots, check the following:

  1. In the ECS console, verify that the disk snapshot feature is enabled in the region. If not, enable it. See Enable Snapshots.

  2. Check that the CSI provisioner pod is running:

    kubectl -nkube-system get pod -l app=csi-provisioner
  3. Verify that disk snapshot permissions are configured. The steps depend on your cluster type:

    • ACK managed clusters: In the RAM console, check that the AliyunCSManagedBackupRestoreRole policy includes permissions for hbr:*, ecs:CreateSnapshot, and oss:* actions on acs:oss:*:*:cnfs-oss*. If the role is missing, go to the RAM Quick Authorization page to grant it. The required policy is:

      {
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "hbr:CreateVault",
              "hbr:CreateBackupJob",
              "hbr:DescribeVaults",
              "hbr:DescribeBackupJobs2",
              "hbr:DescribeRestoreJobs",
              "hbr:SearchHistoricalSnapshots",
              "hbr:CreateRestoreJob",
              "hbr:AddContainerCluster",
              "hbr:DescribeContainerCluster",
              "hbr:CancelBackupJob",
              "hbr:CancelRestoreJob",
              "hbr:DescribeRestoreJobs2"
            ],
            "Resource": "*"
          },
          {
            "Effect": "Allow",
            "Action": [
              "ecs:CreateSnapshot",
              "ecs:DeleteSnapshot",
              "ecs:DescribeSnapshotGroups",
              "ecs:CreateAutoSnapshotPolicy",
              "ecs:ApplyAutoSnapshotPolicy",
              "ecs:CancelAutoSnapshotPolicy",
              "ecs:DeleteAutoSnapshotPolicy",
              "ecs:DescribeAutoSnapshotPolicyEX",
              "ecs:ModifyAutoSnapshotPolicyEx",
              "ecs:DescribeSnapshots",
              "ecs:DescribeInstances",
              "ecs:CopySnapshot",
              "ecs:CreateSnapshotGroup",
              "ecs:DeleteSnapshotGroup"
            ],
            "Resource": "*"
          },
          {
            "Effect": "Allow",
            "Action": [
              "oss:PutObject",
              "oss:GetObject",
              "oss:DeleteObject",
              "oss:GetBucket",
              "oss:ListObjects",
              "oss:ListBuckets",
              "oss:GetBucketStat"
            ],
            "Resource": "acs:oss:*:*:cnfs-oss*"
          }
        ],
        "Version": "1"
      }
    • ACK dedicated clusters: In the ACK console, go to the cluster's Cluster Information page, find Master RAM Role, and check the Permission Management tab. If the k8sMasterRolePolicy-Csi-* policy is missing or incomplete, grant the same policy shown above to the Master RAM role.

    • Registered clusters: Only registered clusters where all nodes are Alibaba Cloud Elastic Compute Service (ECS) instances can use the disk snapshot feature. Check whether the required permissions were configured when you installed the CSI storage plug-in. See Configure RAM permissions for the CSI component.

Cause 3: Non-disk volume type

Cross-region restore is only supported for disk volumes (migrate-controller 1.7.7 and later). For other volume types, if you are using a storage service that supports Internet access — such as OSS — create a statically provisioned PV and PVC and restore the application without StorageClass conversion. See Use an ossfs 1.0 statically provisioned volume.

Backup failures

Why does my backup fail with "backup already exists in OSS bucket"?

A backup with the same name already exists in the OSS bucket associated with the backup vault. Create the backup with a new name.

A backup may be invisible in the current cluster for several reasons: it belongs to an in-progress or failed job (which are not synchronized across clusters), it was deleted in a different cluster (the file is marked but not removed from OSS), or the current cluster is not associated with the vault that stored it.

Why does my backup fail with "get target namespace failed"?

This usually occurs in scheduled backup jobs. The namespace selection is invalid:

  • If you selected Include, all the selected namespaces have been deleted.

  • If you selected Exclude, the excluded namespaces no longer exist in the cluster.

Update the backup plan to fix the namespace selection.

Why does my backup fail with "velero backup process timeout"?

Two common causes:

OSS bucket storage class: If the bucket's storage class is Archive, Cold Archive, or Deep Cold Archive, the backup center cannot update metadata files (archived files must be restored first). Change the bucket's storage class to Standard. To keep archived data at lower cost, configure a lifecycle rule to automatically convert the storage class. See Convert storage classes.

Subtask timeout: The default timeout for backup subtasks is 60 minutes (migrate-controller 1.7.7 and later). If your cluster has many resources or the API server has high latency, increase this value in the csdr-config ConfigMap:

kubectl edit -n csdr cm csdr-config

Add velero_timeout_minutes to the applicationBackup section. For example, to set a 100-minute timeout:

apiVersion: v1
data:
  applicationBackup: |
    ...
    velero_timeout_minutes: 100

Restart the controller for the change to take effect:

kubectl -n csdr delete pod -l control-plane=csdr-controller

Why does my backup fail with "HBR backup request failed"?

Three possible causes:

Incompatible storage plug-in: If your cluster uses a non-Alibaba Cloud CSI storage plug-in, or the PV is not a standard Kubernetes volume type (such as NFS or LocalVolume), submit a ticket for assistance.

Block mode volume: Cloud Backup does not support volumes whose VolumeMode is Block. If your cluster uses CSI, disk snapshots are used for data backup by default, and disk snapshots support Block mode volumes. If the storage plug-in type is incorrect, switch to CSI, reinstall the backup component, and run the backup again.

Cloud Backup client issue: For file system volumes (OSS, NAS, CPFS, or local), the Cloud Backup client may have timed out or failed. To investigate:

  1. Log on to the Cloud Backup console.

  2. Go to Backup > Container Backup and click the Backup Jobs tab.

  3. Select the region in the top navigation bar.

  4. Search for <backup-name>-hbr to view the job status and failure reason. See Back up ACK clusters.

To query a StorageClass conversion or backup job, search for the backup name.

Why does my backup fail with "hbr task finished with unexpected status: FAILED, errMsg SOURCE_NOT_EXIST"?

For third-party CSI or self-managed storage types (NFS, Ceph):

The backup center uses the standard Kubernetes volume mount path as the data backup path. For standard CSI, the default path is /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount. If the kubelet root path in your cluster has been changed, Cloud Backup may not find the data.

Log on to the node where the volume is mounted and troubleshoot:

  1. Find the kubelet root path. Run:

    • If the startup command includes --root-dir, that value is the kubelet root path.

    • If it includes --config, check the config file for the root-dir field.

    • If neither is present, check /etc/systemd/system/kubelet.service for an EnvironmentFile reference, then check that file for ROOT_DIR.

    • If nothing is found, the kubelet root path is the default /var/lib/kubelet.

    ps -elf | grep kubelet
  2. Check whether the root path is a symbolic link:

    ls -al <root-dir>

    If the output shows something like kubelet -> /var/lib/container/kubelet, the actual root path is /var/lib/container/kubelet.

  3. Confirm that the target volume's data exists under the root path at <root-dir>/pods/<pod-uid>/volumes.

  4. Set the KUBELET_ROOT_PATH environment variable in the csdr/csdr-controller deployment to the actual kubelet root path.

For HostPath storage:

HostPath does not create a mount path under the kubelet root path. The backup component cannot read data from the node path by default. Submit a ticket for assistance.

Why does my backup fail with "check backup files in OSS bucket failed", "upload backup files to OSS bucket failed", or "download backup files from OSS bucket failed"?

The OSS server returned an error during a file operation on the backup vault bucket. Three possible causes:

KMS permissions missing: If you enabled server-side encryption with a Key Management Service (KMS) customer master key (CMK) on the OSS bucket, the backup center needs additional permissions. See Does the backup center support KMS encryption for the associated OSS bucket?.

Incomplete OSS permissions: For ACK dedicated clusters and registered clusters, check the permission policy of the RAM user used during component installation. See Step 1: Configure permissions.

Revoked authentication credentials: For ACK dedicated clusters and registered clusters, confirm that the RAM user's authentication credentials are still valid. If they were revoked, get new credentials, update the alibaba-addon-secret Secret in the csdr namespace, and restart the component:

kubectl -nkube-system delete pod -lapp=migrate-controller

Why does my backup show PartiallyFailed with "PROCESS velero partially completed"?

Some cluster resources failed to back up. Run the following command to see which resources failed and why:

kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name>

Check the Errors and Warnings fields in the output and fix the issues. For additional logs:

kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero backup logs <backup-name>

Why does my backup show PartiallyFailed with "PROCESS hbr partially completed"?

Cloud Backup failed to back up some file system volumes (OSS, NAS, CPFS, or local volumes). This can occur because:

  • The storage plug-in used by some volumes is not supported.

  • Files were deleted during the backup, causing a consistency failure.

To investigate, search for <backup-name>-hbr on the Backup Jobs tab of the Cloud Backup console. Select the correct region and review the job status and failure reason. See Back up ACK clusters.

StorageClass conversion failures

Why does my StorageClass conversion fail with "storageclass xxx not exists"?

The target storage class selected for conversion does not exist in the current cluster.

  1. Reset the StorageClass conversion task:

    cat << EOF | kubectl apply -f -
    apiVersion: csdr.alibabacloud.com/v1beta1
    kind: DeleteRequest
    metadata:
      name: reset-convert
      namespace: csdr
    spec:
      deleteObjectName: "<backup-name>"
      deleteObjectType: "Convert"
    EOF
  2. Create the required storage class in the cluster.

  3. Run the restore job again with StorageClass conversion configured.

Why does my StorageClass conversion fail with "only support convert to storageclass with CSI diskplugin or nasplugin provisioner"?

StorageClass conversion only supports Alibaba Cloud CSI disk and NAS volume types as targets. For other requirements, submit a ticket.

If you are using a storage service that supports public network access (such as OSS), create a statically provisioned PV and PVC and restore the application without StorageClass conversion. See Use an ossfs 1.0 statically provisioned volume.

Why does my StorageClass conversion fail with "current cluster is multi-zoned"?

In a multi-zone cluster, if you convert to a disk-type StorageClass whose volumeBindingMode is Immediate, CSI creates the PV in a fixed zone. Pods cannot be scheduled to a different zone, leaving them in Pending.

  1. Reset the StorageClass conversion task (use the same command as above).

  2. Select the correct storage class:

    • Console: Select alicloud-disk. The default is alicloud-disk-topology-alltype.

    • Command line: Use alicloud-disk-topology-alltype. Alternatively, set volumeBindingMode to WaitForFirstConsumer to ensure the PV is created in the same zone as the pod.

  3. Run the restore job again.

Restore failures

Why does my restore fail with "multi-node writing is only supported for block volume"?

The application to be restored has a volume whose AccessMode is ReadWriteMany or ReadOnlyMany. When restoring to Alibaba Cloud disk storage (which does not support multiple mounts by default), CSI blocks the mount.

This occurs in three scenarios:

  • Older CSI version or FlexVolume in the backup cluster: Earlier CSI versions did not check AccessModes during mounting. The volume fails when restored to a cluster with a newer CSI version. Starting from v1.8.4, the backup component automatically converts disk volume AccessModes to ReadWriteOnce. Upgrade the component and restore again.

  • Missing storage class in the restore cluster: The volume is matched to an Alibaba Cloud disk volume by default. Create a storage class with the same name in the restore cluster before restoring, or use StorageClass conversion to specify the target.

  • Manual StorageClass conversion to disk: Add the convertToAccessModes parameter to convert AccessModes to ReadWriteOnce. See convertToAccessModes.

Why does my restore fail with "only disk type PVs support cross-region restore in current version"?

Cross-region restore is only supported for disk volumes (migrate-controller 1.7.7 and later). For other storage types that support Internet access (such as OSS), create a statically provisioned PV and PVC and restore the application. See Use an ossfs 1.0 statically provisioned volume.

Why does my restore fail with "ECS snapshot cross region request failed"?

Cross-region restore for disk volumes requires ECS disk snapshot permissions, which are not granted by default for all cluster types.

For ACK dedicated clusters and registered clusters connected to self-managed Kubernetes deployed on ECS instances, grant ECS disk snapshot permissions. See Registered cluster.

Why does my restore fail with "accessMode of PVC xxx is xxx"?

The disk volume being restored has an AccessMode of ReadOnlyMany or ReadWriteMany. CSI enforces the following rules:

  • Only volumes with multiAttach enabled can be mounted to multiple instances.

  • Volumes with VolumeMode: Filesystem (ext4 or xfs) can only be mounted to multiple instances in read-only mode.

Two recommended approaches:

  • If you are converting a multi-mount volume (such as OSS or NAS) to disk storage, create a new restore job and select alibabacloud-cnfs-nas as the target for StorageClass conversion. This uses a CNFS-managed NAS volume, which supports multiple mounts. See Use CNFS to manage NAS file systems (recommended).

  • If the backed-up disk PV does not meet current CSI requirements (backed up when CSI version was lower and AccessMode detection was not enforced), migrate your workloads to use dynamically provisioned disk volumes to avoid forced disk detachment during scheduling.

My restore status is Completed, but some resources are missing. Why?

A Completed status does not guarantee all resources were restored. Check each possible cause:

The resource was not backed up. Run the following command to inspect the backup:

kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name> --details

Cluster-level resources of pods running in namespaces not selected for backup are not backed up by default. For cluster-level backup configuration, see Cluster-level backup.

The resource was excluded during restoration. Check whether the namespace, resource type, or other filters in the restore job excluded the resource, and re-run the restore.

The restore subtask partially failed. Run the following command to identify failures:

kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe restore <restore-name>

Fix the issues in the Errors and Warnings fields.

The resource was recycled after creation. Check the audit logs for the resource to determine whether it was deleted after being created, due to ownerReferences or business logic.

Other questions

The migrate-controller component in a FlexVolume cluster cannot start

The migrate-controller component does not support FlexVolume clusters. Migrate to CSI before using the backup center:

To back up applications in a FlexVolume cluster and restore them to a CSI cluster during migration, see Use the backup center to migrate applications in a Kubernetes cluster that runs an older version.

Can I modify a backup vault?

No. To make changes, delete the current backup vault and create a new one with a different name.

Because a backup vault is a shared resource that may be in active use at any time, modifying its parameters risks data inaccessibility during ongoing backups or restores. You also cannot create a new vault with the same name as a deleted one.

Can I use an OSS bucket whose name is not in the "cnfs-oss-\*" format?

For clusters other than ACK dedicated clusters and registered clusters, the backup center has read/write access to OSS buckets named in the cnfs-oss-* format by default. Using a differently named bucket requires additional configuration.

  1. Configure OSS permissions for the component. See ACK dedicated cluster.

  2. Restart the backup service component:

    kubectl -n csdr delete pod -l control-plane=csdr-controller
    kubectl -n csdr delete pod -l component=csdr

After creating a vault with a non-standard bucket name, wait for the connectivity check to complete (approximately five minutes) before starting backup or restore operations. Check the vault status:

kubectl -n csdr get backuplocation

Expected output:

NAME                    PHASE       LAST VALIDATED   AGE
a-test-backuplocation   Available   7s               6d1h

How do I specify the backup schedule when creating a backup plan?

The backup schedule supports two formats:

  • Cron expression: For example, 1 4 * * * runs a backup at 4:01 AM every day.

  • Interval: For example, 6h30m runs a backup every 6 hours and 30 minutes.

Cron format reference:

 *  *  *  *  *
 |  |  |  |  |
 |  |  |  |  ·----- day of week (0 - 6, Sun to Sat)
 |  |  |  ·-------- month (1 - 12)
 |  |  .----------- day of month (1 - 31)
 |  ·-------------- hour (0 - 23)
 ·----------------- minute (0 - 59)

Example: 0 2 15 * 1 runs a backup at 2:00 AM on the 15th of each month.

What changes does a restore job make to backed-up YAML resources?

Restore jobs make the following automatic adjustments:

Disk volume size: If a disk volume is smaller than 20 GiB, the size is increased to 20 GiB.

Services: Restored based on the Service type:

  • NodePort Services: Service ports are retained by default during cross-cluster restoration.

  • LoadBalancer Services:

    • When ExternalTrafficPolicy is Local, HealthCheckNodePort uses a random port. To retain the original port, set spec.preserveNodePorts: true in the restore job.

    • If the Service uses an existing Server Load Balancer (SLB) instance from the backup cluster, the restored Service uses the same SLB instance but disables its listeners. Configure the listeners in the SLB console.

    • If the SLB instance is managed by Cloud Controller Manager (CCM) in the backup cluster, CCM creates a new SLB instance. See Considerations for configuring a LoadBalancer Service.

How do I view the resources in a backup?

Cluster application backups:

Run in a cluster that has the backup files synchronized:

kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1
kubectl -n csdr exec -it csdr-velero-xxx -c velero -- ./velero describe backup <backup-name> --details

Or use the ACK console: go to your cluster > Operations > Application Backup > Backup Records, then click a backup record.

Disk volume backups:

In the ECS console, go to Storage & Snapshots > Snapshots and query snapshots by disk ID.

Non-disk volume backups:

In the Cloud Backup console, go to Backup > Container Backup and select the region. The Clusters tab lists backed-up clusters and their PVCs. The Backup Jobs tab shows job status.

If Client Status is abnormal, Cloud Backup is not running correctly in the ACK cluster. Go to the DaemonSets page in the ACK console to troubleshoot.

Can I back up from an older Kubernetes version and restore to a newer version?

Yes. By default, all API versions supported by a resource are backed up. For example, a Deployment in Kubernetes 1.16 supports extensions/v1beta1, apps/v1beta1, apps/v1beta2, and apps/v1. All four versions are stored in the backup vault regardless of which version was used to create it. The KubernetesConvert feature handles API version conversion during restore.

When restoring, the API version recommended by the restore cluster is used. For example, restoring to Kubernetes 1.28 uses apps/v1 for Deployments.

Important

If no API version is shared between the source and target clusters, deploy the resource manually. For example, Ingresses in Kubernetes 1.16 clusters use extensions/v1beta1 and networking.k8s.io/v1beta1, which are not supported in Kubernetes 1.22 and later (only networking.k8s.io/v1 is supported). For API version migration details, see the Kubernetes deprecation guide. Avoid migrating from newer to older Kubernetes versions, and avoid migrating from versions earlier than 1.16 to newer versions.

Is traffic automatically switched to SLB instances during restoration?

No. After restoration, SLB listeners are either disabled or a new SLB instance is created (depending on the original configuration). Traffic is not automatically switched.

If you use other service discovery mechanisms and want to control when traffic switches, exclude Service resources during backup and deploy them manually when you are ready to switch.

Why aren't resources in csdr, ack-csi-fuse, kube-system, kube-public, and kube-node-lease backed up by default?

  • csdr: This is the backup center's own namespace. Backing it up directly would cause components to fail in the restore cluster. Backup synchronization is handled automatically — you do not need to migrate backups manually.

  • ack-csi-fuse: This namespace runs FUSE client pods maintained by CSI. The CSI in the new cluster automatically synchronizes these clients during storage restoration.

  • kube-system, kube-public, kube-node-lease: These are Kubernetes system namespaces. Due to differences in cluster parameters and configurations, restoring these namespaces across clusters is not supported. Before running a restore job, install and configure system components in the restore cluster manually — for example, the Container Registry password-free image pulling component (acr-configuration) and the ALB Ingress component (ALBConfig).

Does the backup center use ECS disk snapshots for disk volumes? What is the default snapshot type?

The backup center uses ECS disk snapshots by default in the following scenarios:

  • The cluster is an ACK managed or dedicated cluster.

  • The cluster runs Kubernetes 1.18 or later and uses CSI 1.18 or later.

In other scenarios, Cloud Backup is used for disk data backup.

ECS disk snapshots created by the backup center have the instant access feature enabled by default. The snapshot validity period matches the validity period specified in the backup configuration. Starting from October 12, 2023, 11:00, Alibaba Cloud no longer charges for snapshot instant access storage or operations in any region. See Use the instant access feature.

Why is the ECS disk snapshot validity period different from what I specified in the backup configuration?

The validity period configuration depends on the csi-provisioner component. If csi-provisioner is older than version 1.20.6, VolumeSnapshots are created without the validity period or instant access settings — so the backup configuration does not affect disk snapshots.

Upgrade csi-provisioner to 1.20.6 or later to ensure the validity period is applied correctly.

If upgrading is not possible, configure a default snapshot validity period instead:

  1. Update migrate-controller to v1.7.10 or later.

  2. Check whether a VolumeSnapshotClass with the 30-day retention setting exists:

    kubectl get volumesnapshotclass csdr-disk-snapshot-with-default-ttl
  3. If it does not exist, or if it exists but retentionDays is not set to 30, apply the following:

    apiVersion: snapshot.storage.k8s.io/v1
    deletionPolicy: Retain
    driver: diskplugin.csi.alibabacloud.com
    kind: VolumeSnapshotClass
    metadata:
      name: csdr-disk-snapshot-with-default-ttl
    parameters:
      retentionDays: "30"

All disk volume backups in the cluster will then create snapshots with the retention period set in retentionDays.

What is volume data backup, and when do I need it?

What it does: Volume data backup copies volume data to cloud storage using ECS disk snapshots or Cloud Backup. When you restore the application, a new disk or NAS file system is created from this copy. The restored application and the original application have independent data — changes in one do not affect the other.

If you do not need data isolation or if your storage already provides cross-zone or cross-region access (such as OSS), skip volume data backup. PVC and PV YAML files are still included in the application backup by default.

When to use it:

  • Disaster recovery or versioned data records.

  • The application uses disk volumes (basic disks can only be mounted to a single node).

  • Cross-region backup and restore is required (most storage types other than OSS do not support cross-region access).

  • Data isolation between the original and restored applications is required.

  • The storage plug-ins or versions differ significantly between the backup and restore clusters, making direct YAML restoration impractical.

Risks of not backing up volumes for stateful applications:

  • Volumes with `Delete` reclaim policy: CSI creates a new, empty PV during restore. Statically provisioned volumes without a matching storage class stay in Pending until you manually create a PV or storage class.

  • Volumes with `Retain` reclaim policy: Resources are restored in PV-first order. For multi-mount storage (NAS, OSS), the original file system or bucket is reused. For disks, there is a risk of forced disk detachment.

To check the reclaim policy of your volumes:

kubectl get pv -o=custom-columns=CLAIM:.spec.claimRef.name,NAMESPACE:.spec.claimRef.namespace,NAME:.metadata.name,RECLAIMPOLICY:.spec.persistentVolumeReclaimPolicy

How do I select nodes for file system backups in data protection?

By default, Cloud Backup jobs can run on any node except virtual nodes. Only one volume backup job runs on a node at a time.

Three node scheduling policies are available:

PolicyBehavior
exclude (default)All nodes are eligible. Add csdr.alibabacloud.com/agent-excluded="true" to exclude specific nodes.
includeOnly labeled nodes are eligible. Add csdr.alibabacloud.com/agent-included="true" to enable specific nodes.
preferAll nodes are eligible. Nodes with csdr.alibabacloud.com/agent-included="true" are preferred; nodes with csdr.alibabacloud.com/agent-excluded="true" are used last.

To label a node:

# Exclude a node
kubectl label node <node-name> csdr.alibabacloud.com/agent-excluded="true"

# Include a node (for the include policy)
kubectl label node <node-name> csdr.alibabacloud.com/agent-included="true"

To change the policy, edit the csdr-config ConfigMap:

kubectl -n csdr edit cm csdr-config

Add node_schedule_policy to the applicationBackup section:

apiVersion: v1
data:
  applicationBackup: |
    backup_max_worker_num: 15
    restore_max_worker_num: 5
    delete_max_worker_num: 30
    schedule_max_worker_num: 20
    convert_max_worker_num: 15
    node_schedule_policy: include  # Valid values: include, exclude, prefer
  pvBackup: |
    batch_snapshot_max_num: 20
    enable_ecs_snapshot: "true"
kind: ConfigMap

Restart the controller for the change to take effect:

kubectl -n csdr delete pod -lapp=csdr-controller

What are the differences between application backup and data protection?

Application backup backs up Kubernetes workloads — including applications, Services, and configuration files in namespaces. Optionally include volume data for mounted volumes. Use application backup to migrate applications between clusters or restore applications for disaster recovery.

Application backup does not back up volumes that are not mounted to pods. To back up all volumes, create data protection backup jobs.

Data protection backs up storage volumes — PVCs and PVs — independently of application workloads. Use data protection to restore a deleted PVC as a standalone operation, or to implement data replication and disaster recovery at the storage layer.

How do I exclude specific persistent volumes from backup and recovery?

Some volumes, such as log storage or high-availability storage (OSS), do not need to be backed up. Here is a workflow for backing up a namespace that contains Volume A (no backup needed) and Volume B (backup needed):

Backup flow:

  1. Use data protection to back up Volume B — this backs up both the YAML and the data.

  2. Use application backup to select the namespace with Backup Volume set to Disable. This backs up the YAML files for both Volume A and Volume B without copying data.

    If you do not want Volume A restored in the target cluster at all, add pvc, pv to Excluded Resources in the advanced configuration.

Restore flow:

  1. Restore the data protection backup in the target cluster. This restores the YAML and data of Volume B.

  2. Restore the application backup in the target cluster. This restores the YAML of Volume A and all other application resources. CSI then creates a new storage source or reuses the existing one based on Volume A's reclaim policy.

After both restores complete, the application, Volume A, and Volume B (with its data) are all running in the target cluster.

Does the backup center support KMS encryption for the associated OSS bucket?

The backup center supports server-side encryption for OSS buckets. Enable it in the OSS console. See Server-side encryption.

If you use a KMS-managed CMK (bring your own key, or BYOK) with a specified CMK ID, grant the backup center permissions to access KMS:

  1. Create a custom permission policy:

    {
      "Version": "1",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "kms:List*",
            "kms:DescribeKey",
            "kms:GenerateDataKey",
            "kms:Decrypt"
          ],
          "Resource": [
            "acs:kms:*:141661496593****:*"
          ]
        }
      ]
    }

    This grants access to all KMS keys under the Alibaba Cloud account. For more granular resource control, see Authorization information.

  2. Attach the policy:

If you use a KMS key managed by OSS or a key fully managed by OSS, no additional permissions are needed.

How do I change the container images used during restoration?

Change the image registry address:

For hybrid cloud deployments or migrations from on-premises to cloud, use the imageRegistryMapping field to remap image registry addresses. For example, to change docker.io/library/ to registry.cn-beijing.aliyuncs.com/my-registry/:

docker.io/library/: registry.cn-beijing.aliyuncs.com/my-registry/

Change the image repository or version:

Create a ConfigMap in the csdr namespace before running the restore:

apiVersion: v1
kind: ConfigMap
metadata:
  name: <configuration-name>
  namespace: csdr
  labels:
    velero.io/plugin-config: ""
    velero.io/change-image-name: RestoreItemAction
data:
  "case1": "app1:v1,app2:v2"
  # Change only the repository: "case1": "app1,app2"
  # Change only the version: "case1": "v1:v2"
  # Change a specific registry image: "case1": "docker.io/library/app1:v1,registry.cn-beijing.aliyuncs.com/my-registry/app2:v2"

For multiple changes, add case2, case3, and so on to the data field. After creating the ConfigMap, run the restore job with the imageRegistryMapping field left blank.

These changes apply to all restore jobs in the cluster. Use specific patterns (such as limiting to a particular registry) to avoid unintended changes. Delete the ConfigMap when it is no longer needed.