All Products
Search
Document Center

Container Service for Kubernetes:FAQ about the backup center

Last Updated:May 27, 2025

This topic provides answers to some frequently asked questions about the backup center.

Table of contents

Category

Issue

Obtaining error messages

Common operations

Console

General

Backup

StorageClass conversion

(Optional step during restoration)

Restore

Others

Common operations

Note

If you use the backup center with kubectl, upgrade the migrate-controller component to the latest version before you troubleshoot issues. The upgrade does not affect existing backups. For more information about how to upgrade the component, see Manage components.

When the status of a backup task, StorageClass conversion task, or restore task is Failed or PartiallyFailed, you can obtain error messages by using the following methods:

  • Move the pointer over Failed or PartiallyFailed in the Status column to view a brief error message, such as RestoreError: snapshot cross region request failed.image.png

  • To obtain more detailed error messages, run the following commands to query the events of the task, such as RestoreError: process advancedvolumesnapshot failed avs: snapshot-hz, err: transition canceled with error: the ECS-snapshot related ram policy is missing.

    • Backup task

      kubectl -n csdr describe applicationbackup <backup-name> 
    • StorageClass conversion task

      kubectl -n csdr describe converttosnapshot <backup-name>
    • Restore task

      kubectl -n csdr describe applicationrestore <restore-name>

The console displays "The working component is abnormal" or "Failed to fetch current data"

Issue

The console displays The Working Component Is Abnormal or Failed To Fetch Current Data.

Cause

The installation of the backup center component is abnormal.

Solution

  • Check whether nodes that belong to the cluster exist. If nodes that belong to the cluster do not exist, the backup center cannot be deployed.

  • Check whether the cluster uses FlexVolume. If the cluster uses FlexVolume, switch to CSI. For more information, see The migrate-controller component in a cluster that uses FlexVolume cannot be launched.

  • If you use the backup center with kubectl, check whether the YAML configurations are correct. For more information, see Use kubectl to back up and restore applications.

  • If your cluster is an ACK dedicated cluster or a registered cluster, check whether the required permissions are granted. For more information, see ACK dedicated cluster and Registered cluster.

  • Check whether the csdr-controller and csdr-velero Deployments in the csdr namespace fail to be deployed due to resource or scheduling limits. If yes, fix the issue.

The console displays the following error: The name is already used. Change the name and try again

Issue

When you create or delete a backup task, StorageClass conversion task, or restore task, the console displays The Name Is Already Used. Change The Name And Try Again.

Cause

When you delete a task in the console, a deleterequest resource is created in the cluster. The working component performs a series of deletion operations, not just deleting the corresponding backup resource. The same applies to command line operations. For more information, see Use kubectl to back up and restore applications.

If the deletion operation is incorrect or an error occurs during the processing of the deleterequest resource, some resources in the cluster cannot be deleted. In this case, the error message that indicates the existence of resources with the same name is returned.

Solution

  • Delete the resources with the same name as prompted. For example, if the error message deleterequests.csdr.alibabacloud.com "xxxxx-dbr" already exists is returned, run the following command to delete the resource:

    kubectl -n csdr delete deleterequests xxxxx-dbr
  • Create a task with a new name.

I cannot select an existing backup when I restore an application across clusters

Issue

I cannot select a backup task when I restore an application across clusters.

Cause

  • Cause 1: The backup vault is not associated with the current cluster, which means that the backup vault is not initialized.

    The system initializes the backup vault and synchronizes the basic information about the backup vault, including the Object Storage Service (OSS) bucket information, to the cluster. Then, the system initializes the backup files from the backup vault in the cluster. You can select a backup file from the backup vault for restoration only after the initialization is complete.

  • Cause 2: The initialization of the backup vault fails. The status of the backuplocation resource in the current cluster is Unavailable.

  • Cause 3: The backup task is not complete or the backup task fails.

Solution

  • Solution 1:

On the Create Restore Task page, click Initialize Vault on the right side of Backup Vault. After the backup vault is initialized, select the task that you want to restore.

  • Solution 2:

Run the following command to check the status of the backuplocation resource:

kubectl get -n csdr backuplocation <backuplocation-name> 

Expected result:

NAME                    PHASE       LAST VALIDATED   AGE
<backuplocation-name>   Available   3m36s            38m

If the status is Unavailable, see the solution in The status of the task is Failed and the "VaultError: xxx" error is returned.

Solution 3:

In the console of the backup cluster, check whether the backup task is successful, which means that the status of the backup task is Completed. If the status of the backup task is abnormal, troubleshoot the issue. For more information, see Table of contents.

The console displays "The service role required by the current component has not been authorized"

Issue

When you access the application backup console, the console displays "The service role required by the current component has not been authorized" and the error code AddonRoleNotAuthorized is returned.

Cause

The cloud resource authentication logic of the migrate-controller component in ACK managed clusters is optimized in migrate-controller 1.8.0. When you install or upgrade the component to this version for the first time, the Alibaba Cloud account must complete cloud resource authorization.

Solution

  • If you are logged on with an Alibaba Cloud account, click Authorize to complete the authorization.

  • If you are logged on with a RAM user, click Copy Authorization Link and send the link to the Alibaba Cloud account to complete the authorization.

The console displays "The current account has not been granted the cluster RBAC permissions required for this operation"

Issue

When you access the application backup console, the console displays "The current account has not been granted the cluster RBAC permissions required for this operation. Contact the primary account or permission administrator for authorization." The error code is APISERVER.403.

Cause

The console interacts with the API server to submit backup and restore tasks and obtain real-time task status. The default permission list for cluster O&M personnel and developers lacks some permissions required by the backup center component. The primary account or permission administrator needs to grant these permissions.

Solution

Refer to Use custom RBAC roles to restrict resource operations in a cluster and grant the following ClusterRole permissions to backup center operators:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: csdr-console
rules:
  - apiGroups: ["csdr.alibabacloud.com","velero.io"]
    resources: ['*']
    verbs: ["get","create","delete","update","patch","watch","list","deletecollection"]
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["get","list"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get","list"]

The backup center component fails to be upgraded or uninstalled

Issue

The backup center component fails to be upgraded or uninstalled, and the csdr namespace remains in the Terminating state.

Cause

The backup center component exits abnormally during operation, leaving tasks in the InProgress state in the csdr namespace. The finalizers field of these tasks may prevent resources from being deleted smoothly, causing the csdr namespace to remain in the Terminating state.

Solution

  • Run the following command to check why the csdr namespace is in the Terminating state:

    kubectl describe ns csdr

    Confirm that the stuck tasks are no longer needed and delete their corresponding finalizers.

  • After confirming that the csdr namespace is deleted:

    • For component upgrade scenarios, you can reinstall the migrate-controller component of the backup center.

    • For component uninstallation scenarios, the component should already be uninstalled.

The status of the task is failed and the "internal error" error is returned

Issue

The status of the task is Failed and the "internal error" error is returned.

Cause

The component or underlying cloud service encounters an unexpected exception, such as when the cloud service is not available in the current region.

Solution

If the error message is "HBR backup/restore internal error", check whether the container backup feature is available in the Cloud Backup console.

For more questions of this type, please submit a ticket for processing.

The status of the task is failed and the "create cluster resources timeout" error is returned

Issue

The status of the task is Failed and the "create cluster resources timeout" error is returned.

Cause

During StorageClass conversion or restoration, temporary pods, persistent volume claims (PVCs), and persistent volumes (PVs) may be created. If these resources remain unavailable for a long time after they are created, the "create cluster resources timeout" error is returned.

Solution

  1. Run the following command to locate the abnormal resource and find the cause based on the events:

    kubectl -n csdr describe <applicationbackup/converttosnapshot/applicationrestore> <task-name> 

    Expected result:

    ……wait for created tmp pvc default/demo-pvc-for-convert202311151045 for convertion bound time out

    This indicates that the PVC used for StorageClass conversion remains unbound for a long time. The PVC is in the default namespace and is named demo-pvc-for-convert202311151045.

  2. Run the following command to check the status of the PVC and identify the cause of the issue:

    kubectl -ndefault describe pvc demo-pvc-for-convert202311151045 

    The following are common causes of issues in the backup center. For more information, see Storage troubleshooting.

    • The cluster or node resources are insufficient or abnormal.

    • The restore cluster does not have the corresponding StorageClass. Use the StorageClass conversion feature to select an existing StorageClass in the restore cluster and then restore the application.

    • The underlying storage associated with the StorageClass is unavailable. For example, the specified disk type is not supported in the current zone.

    • The Container Network File System (CNFS) associated with alibabacloud-cnfs-nas is abnormal. For more information, see Use CNFS to manage NAS file systems (recommended).

    • When you restore applications in a multi-zone cluster, you select a StorageClass whose volumeBindingMode is set to Immediate.

The status of the task is failed and the "addon status is abnormal" error is returned

Issue

The status of the task is Failed and the "addon status is abnormal" error is returned.

Cause

The components in the csdr namespace are abnormal.

Solution

See Cause 1 and solution: The components in the csdr namespace are abnormal.

The status of the task is failed and the "VaultError: xxx" error is returned

Issue

The status of the backup, restore, or StorageClass conversion task is Failed and the error message VaultError: backup vault is unavailable: xxx is returned.

Cause

  • The specified OSS bucket does not exist.

  • The cluster does not have the permissions to access OSS.

  • The network of the OSS bucket is unreachable.

Solution

  1. Log on to the OSS console and check whether the OSS bucket associated with the backup vault exists.

    If the OSS bucket does not exist, create a bucket and re-associate it. For more information, see Create buckets.

  2. Check whether the cluster has the permissions to access OSS.

    • ACK Pro cluster: You do not need to configure OSS permissions. Make sure that the name of the OSS bucket associated with the backup vault of the cluster starts with cnfs-oss-**.

    • ACK dedicated cluster and registered cluster: You must configure OSS permissions. For more information, see Install migrate-controller and grant permissions.

    For ACK managed clusters that are not installed or upgraded to v1.8.0 or later by using the console, OSS-related permissions may be missing. You can run the following command to check whether the cluster has the permissions to access OSS:

    kubectl get secret -n kube-system | grep addon.aliyuncsmanagedbackuprestorerole.token

    Expected result:

    addon.aliyuncsmanagedbackuprestorerole.token          Opaque                      1      62d

    If the returned content is the same as the preceding expected output, the cluster has permissions to access OSS. You only need to specify an OSS bucket that is named in the cnfs-oss-* format for the cluster.

    If the returned content is not the same as the expected output, use one of the following methods to grant permissions:

    • Refer to the ACK dedicated cluster and registered cluster section to configure OSS permissions. For more information, see Install migrate-controller and grant permissions.

    • Use an Alibaba Cloud account to click Authorize to complete the authorization. You need to perform this operation only once for each Alibaba Cloud account.

    Note

    You cannot create a backup vault that uses the same name as a deleted one. You cannot associate a backup vault with an OSS bucket whose name is not in the cnfs-oss-** format. If you have associated a backup vault with an OSS bucket whose name is not in the cnfs-oss-** format, create a backup vault with a different name and associate it with an OSS bucket whose name is in the cnfs-oss-** format.

  3. Run the following command to check the network configurations of the cluster:

    kubectl get backuplocation <backuplocation-name> -n csdr -o yaml | grep network

    The output is similar to the following content:

    network: internal
    • If the value of network is internal, the backup vault accesses the OSS bucket over the internal network.

    • If the value of network is public, the backup vault accesses the OSS bucket over the Internet. If the backup vault accesses the OSS bucket over the Internet and the error message indicates a timeout, check whether the cluster can access the Internet. For more information, see Enable an existing ACK cluster to access the Internet.

    In the following scenarios, the backup vault must access the OSS bucket over the Internet:

    • The cluster and the OSS bucket are deployed in different regions.

    • The current cluster is an ACK Edge cluster.

    • The current cluster is a registered cluster and is not connected to a virtual private cloud (VPC) by using Cloud Enterprise Network (CEN), Express Connect, or VPN Gateway. Alternatively, the cluster is connected to a VPC but no route is configured to point to the internal OSS endpoint of the region. You must configure a route to point to the internal OSS endpoint of the region.

    If the backup vault must access the OSS bucket over the Internet, run the following commands to change the access method to Internet access. In the following code, <backuplocation-name> specifies the name of the backup vault and <region-id> specifies the region where the OSS bucket is deployed, such as cn-hangzhou.

    kubectl patch -n csdr backuplocation/<backuplocation-name> --type='json' -p   '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'
    kubectl patch -n csdr backupstoragelocation/<backuplocation-name> --type='json' -p   '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'

The status of the task is Failed and the "HBRError: check HBR vault error" error is returned

Issue

The status of the backup, restore, or StorageClass conversion task is Failed and the "HBRError: check HBR vault error" error is returned.

Cause

Cloud Backup is not activated or does not have the required permissions.

Solution

  1. Check whether Cloud Backup is activated. For more information, see Activate Cloud Backup.

  2. If your cluster resides in China (Ulanqab), China (Heyuan), or China (Guangzhou), you must grant Cloud Backup the permissions to access API Gateway after you activate Cloud Backup. For more information, see Step 3 (optional): Authorize Cloud Backup to access API Gateway .

  3. If your cluster is an ACK dedicated cluster or registered cluster, make sure that the Resource Access Management (RAM) user you use have the permissions to access Cloud Backup. For more information about how to perform the authorization, see Install migrate-controller and grant permissions.

The status of the task is Failed and the "hbr task finished with unexpected status: FAILED, errMsg ClientNotExist" error is returned

Issue

The status of the backup, restore, or StorageClass conversion task is Failed and the error message hbr task finished with unexpected status: FAILED, errMsg ClientNotExist is returned.

Cause

The Cloud Backup client is abnormally deployed on the corresponding node, which means that the replica of the hbr-client DaemonSet on the node in the csdr namespace is abnormal.

Solution

  1. Run the following command to check whether abnormal hbr-client pods exist in the cluster:

    kubectl -n csdr get pod -lapp=hbr-client
  2. If pods are in an abnormal state, first check whether the issue is caused by insufficient pod IP addresses, memory, or CPU resources. If the status of a pod is CrashLoopBackOff, run the following command to view the logs of the pod:

    kubectl -n csdr logs -p <hbr-client-pod-name>

    If the output contains "SDKError:\n StatusCode: 403\n Code: MagpieBridgeSlrNotExist\n Message: code: 403, AliyunServiceRoleForHbrMagpieBridge doesn't exist, please create this role. ", see Step 3 (optional): Authorize Cloud Backup to access API Gateway to grant permissions to Cloud Backup.

  3. If the log output contains other types of SDK errors, submit a ticket for processing.

The task remains in the InProgress state for a long period of time

Cause 1 and solution: The components in the csdr namespace are abnormal

Check the status of the components and identify the cause of the anomaly.

  1. Run the following command to check whether the components in the csdr namespace are restarting or cannot be started:

    kubectl get pod -n csdr
  2. Run the following command to check why the components are restarting or cannot be started:

    kubectl describe pod <pod-name> -n csdr

If the cause is OOM restart

  • If the OOM exception occurs during restoration, the pod is csdr-velero-***, and many applications are running in the restore cluster, such as dozens of production namespaces. The OOM exception may occur because Velero uses the Informer Cache to accelerate the restore process by default, and the cache occupies some memory.

    If the number of resources to be restored is small or you can accept some performance impact during restoration, you can run the following command to disable the Informer Cache feature:

    kubectl -nkube-system edit deploy migrate-controller

    Add the parameter --disable-informer-cache=true to the args of the migrate-controller container:

            name: migrate-controller
            args:
            - --disable-informer-cache=true
  • For other cases, or if you do not want to reduce the speed of cluster resource restoration, run the following command to adjust the Limit value of the corresponding Deployment.

    For csdr-controller-***, <deploy-name> is csdr-controller. For csdr-velero-***, <deploy-name> is csdr-velero.

    kubectl patch deploy  <deploy-name> -p '{"spec":{"containers":{"resources":{"limits":"<new-limit-memory>"}}}}'

If the cause is HBR permissions are not configured, which causes the launch to fail

  1. Confirm that the cluster has activated the Cloud Backup service.

    • Not activated: Please activate Cloud Backup service. For more information, see Cloud Backup.

    • If Cloud Backup is activated, proceed with the next step.

  2. Confirm that the ACK dedicated cluster and registered cluster have Cloud Backup permissions configured.

  3. Execute the following command to confirm whether the token required by the Cloud Backup Client widget exists.

    kubectl describe <hbr-client-***>

    If the event error message couldn't find key HBR_TOKEN is returned, the token is missing. Perform the following steps to resolve the issue:

    1. Run the following command to query the node where hbr-client-*** is located:

      kubectl get pod <hbr-client-***> -n csdr -owide
    2. Run the following command to change the labels: csdr.alibabacloud.com/agent-enable of the corresponding node from true to false:

      kubectl label node <node-name> csdr.alibabacloud.com/agent-enable=false --overwrite
      Important
      • When you back up and restore applications again, the token is automatically created and hbr-client is launched.

      • If you copy the token from another cluster to the current cluster, the hbr-client that is started will not be active. You need to delete the copied token and the hbr-client-*** pod that is started by this token, and then execute the preceding steps.

Cause 2 and solution: Cluster snapshot permissions are not configured for disk volume backup

When you back up the disk volume that is mounted to your application, if the backup task remains in the InProgress state for a long period of time, run the following command to query the newly created VolumeSnapshot resources in the cluster:

kubectl get volumesnapshot -n <backup-namespace>

Sample output:

NAME                    READYTOUSE      SOURCEPVC         SOURCESNAPSHOTCONTENT         ...
<volumesnapshot-name>   true                              <volumesnapshotcontent-name>  ...

If the READYTOUSE field of all volumesnapshot resources remains false for a long time, perform the following steps:

  1. Log on to the ECS console and check whether the disk snapshot feature is enabled.

    • If the disk snapshot feature is not enabled, enable the disk snapshot feature in the corresponding region. For more information, see Activate ECS Snapshot.

    • If the disk snapshot feature is enabled, proceed with the next step.

  2. Check whether the Container Storage Interface (CSI) component of the cluster runs as normal.

    kubectl -nkube-system get pod -l app=csi-provisioner
  3. Check whether permissions to use disk snapshots are configured.

    ACK managed cluster

    1. Log on to the RAM console as a RAM user who has administrative rights.

    2. In the left-side navigation pane, choose Identities > Roles.

    3. On the Roles page, search for AliyunCSManagedBackupRestoreRole in the search box and check whether the authorization policy of the role contains the following policy content:

      {
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "hbr:CreateVault",
              "hbr:CreateBackupJob",
              "hbr:DescribeVaults",
              "hbr:DescribeBackupJobs2",
              "hbr:DescribeRestoreJobs",
              "hbr:SearchHistoricalSnapshots",
              "hbr:CreateRestoreJob",
              "hbr:AddContainerCluster",
              "hbr:DescribeContainerCluster",
              "hbr:CancelBackupJob",
              "hbr:CancelRestoreJob",
              "hbr:DescribeRestoreJobs2"
            ],
            "Resource": "*"
          },
          {
            "Effect": "Allow",
            "Action": [
              "ecs:CreateSnapshot",
              "ecs:DeleteSnapshot",
              "ecs:DescribeSnapshotGroups",
              "ecs:CreateAutoSnapshotPolicy",
              "ecs:ApplyAutoSnapshotPolicy",
              "ecs:CancelAutoSnapshotPolicy",
              "ecs:DeleteAutoSnapshotPolicy",
              "ecs:DescribeAutoSnapshotPolicyEX",
              "ecs:ModifyAutoSnapshotPolicyEx",
              "ecs:DescribeSnapshots",
              "ecs:DescribeInstances",
              "ecs:CopySnapshot",
              "ecs:CreateSnapshotGroup",
              "ecs:DeleteSnapshotGroup"
            ],
            "Resource": "*"
          },
          {
            "Effect": "Allow",
            "Action": [
              "oss:PutObject",
              "oss:GetObject",
              "oss:DeleteObject",
              "oss:GetBucket",
              "oss:ListObjects",
              "oss:ListBuckets",
              "oss:GetBucketStat"
            ],
            "Resource": "acs:oss:*:*:cnfs-oss*"
          }
        ],
        "Version": "1"
      }

    ACK dedicated cluster

    1. Log on to the Container Service for Kubernetes (ACK) console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, click Cluster Information.

    3. On the Cluster Information page, find the Master RAM Role parameter and click the link on the right.

    4. On the Permissions tab, check whether the disk snapshot permissions are normal.

      If the k8sMasterRolePolicy-Csi-*** policy does not exist or does not include the following permissions, attach the following disk snapshot policy to the master RAM role. For more information, see Create custom policies and Grant permissions to a RAM role.

    • {
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "hbr:CreateVault",
              "hbr:CreateBackupJob",
              "hbr:DescribeVaults",
              "hbr:DescribeBackupJobs2",
              "hbr:DescribeRestoreJobs",
              "hbr:SearchHistoricalSnapshots",
              "hbr:CreateRestoreJob",
              "hbr:AddContainerCluster",
              "hbr:DescribeContainerCluster",
              "hbr:CancelBackupJob",
              "hbr:CancelRestoreJob",
              "hbr:DescribeRestoreJobs2"
            ],
            "Resource": "*"
          },
          {
            "Effect": "Allow",
            "Action": [
              "ecs:CreateSnapshot",
              "ecs:DeleteSnapshot",
              "ecs:DescribeSnapshotGroups",
              "ecs:CreateAutoSnapshotPolicy",
              "ecs:ApplyAutoSnapshotPolicy",
              "ecs:CancelAutoSnapshotPolicy",
              "ecs:DeleteAutoSnapshotPolicy",
              "ecs:DescribeAutoSnapshotPolicyEX",
              "ecs:ModifyAutoSnapshotPolicyEx",
              "ecs:DescribeSnapshots",
              "ecs:DescribeInstances",
              "ecs:CopySnapshot",
              "ecs:CreateSnapshotGroup",
              "ecs:DeleteSnapshotGroup"
            ],
            "Resource": "*"
          },
          {
            "Effect": "Allow",
            "Action": [
              "oss:PutObject",
              "oss:GetObject",
              "oss:DeleteObject",
              "oss:GetBucket",
              "oss:ListObjects",
              "oss:ListBuckets",
              "oss:GetBucketStat"
            ],
            "Resource": "acs:oss:*:*:cnfs-oss*"
          }
        ],
        "Version": "1"
      }
    • After the permission configuration is complete, if the problem is still not resolved, please submit a ticket for processing.

    Registered cluster

    Only registered clusters whose nodes are all Alibaba Cloud Elastic Compute Service (ECS) instances can use the disk snapshot feature. Check whether the related permissions are granted when you install the CSI storage plug-in. For more information, see Step 1: Grant a RAM user the permissions to manage the CSI plug-in.

Cause 3 and solution: Storage volumes other than disk volumes are used

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. Backups of other volume types cannot be restored across regions. If you are using a storage service that supports Internet access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application. For more information, see Mount a statically provisioned ossfs 1.0 volume.

The status of the backup task is failed and the "backup already exists in OSS bucket" error is returned

Issue

The status of the backup task is Failed and the "backup already exists in OSS bucket" error is returned.

Cause

A backup with the same name is stored in the OSS bucket associated with the backup vault.

A backup may be invisible in the current cluster due to the following reasons:

  • Backups in ongoing backup tasks and failed backup tasks are not synchronized to other clusters.

  • If you delete a backup in a cluster other than the backup cluster, the backup file in the OSS bucket is labeled but not deleted. The labeled backup file will not be synchronized to newly associated clusters.

  • The current cluster is not associated with the backup vault that stores the backup, which means that the backup vault is not initialized.

Solution

Create a backup vault with a new name.

The status of the backup task is failed and the "get target namespace failed" error is returned

Issue

The status of the backup task is Failed and the "get target namespace failed" error is returned.

Cause

In most cases, this error occurs in backup tasks that are created at a scheduled time. The cause varies based on the way how you select namespaces.

  • If you select Include, all the selected namespaces are deleted.

  • If you select Exclude, no namespace other than the selected namespaces exists in the cluster.

Solution

Modify the backup plan to change the method that is used to select namespaces and change the namespaces that you have selected.

The status of the backup task is failed and the "velero backup process timeout" error is returned

Issue

The status of the backup task is Failed and the "velero backup process timeout" error is returned.

Cause

  • Cause 1: The subtask of the application backup times out. The duration of a subtask varies based on the amount of cluster resources and the response latency of the API server. In migrate-controller 1.7.7 and later, the default timeout period of subtasks is 60 minutes.

  • Cause 2: The storage class of the bucket used by the backup vault is Archive Storage, Cold Archive, or Deep Cold Archive. To ensure data consistency during the backup process, files that record metadata must be updated by the backup center component on the OSS server. The backup center component cannot update files that are not restored.

Solution

  • Solution 1: Modify the global configuration of the subtask timeout period in the backup cluster.

    Run the following command to add the velero_timeout_minutes configuration item to applicationBackup. The unit is minutes.

    kubectl edit -n csdr cm csdr-config

    For example, the following code block sets the timeout period to 100 minutes:

    apiVersion: v1
    data:
      applicationBackup: |
        ... #Details not shown.
        velero_timeout_minutes: 100

    After you modify the timeout period, run the following command to restart csdr-controller for the modification to take effect:

    kubectl -n csdr delete pod -l control-plane=csdr-controller
  • Solution 2: Change the storage class of the bucket used by the backup vault to Standard.

    If you want to store backup data in Archive Storage, you can configure a lifecycle rule to automatically convert the storage class and restore the data before restoration. For more information, see Convert storage classes.

The status of the backup task is failed and the "HBR backup request failed" error is returned

Issue

The status of the backup task is Failed and the "HBR backup request failed" error is returned.

Cause

  • Cause 1: The storage plug-in used by the cluster is not compatible.

  • Cause 2: Cloud Backup does not support backing up volumes whose volumeMode is Block. For more information, see Volume Mode.

  • Cause 3: The Cloud Backup client is abnormal, which causes the backup or restore task for file system volumes, such as OSS volumes, File Storage NAS (NAS) volumes, Cloud Parallel File Storage (CPFS) volumes, or local volumes, to time out or fail.

Solution

  • Solution 1: If your cluster uses a non-Alibaba Cloud CSI storage plug-in, or if the persistent volume (PV) is not a common Kubernetes storage volume such as NFS or LocalVolume, and you encounter compatibility issues, please submit a ticket for assistance.

  • Solution 2: Please submit a ticket for processing.

  • Solution 3: Perform the following steps:

    1. Log on to the Cloud Backup console.

    2. In the left-side navigation pane, choose Backup > Container Backup. On the Container Backup page, click the Backup Jobs tab.

    3. In the top navigation bar, select a region.

    4. On the Backup Jobs tab, select Job Name from the drop-down list next to the search box, enter <backup-name>-hbr in the search box, and click the search icon. After the backup task is displayed, you can view the task status. If the tasks is in an abnormal state, the cause of the anomaly is displayed. For more information, see Back up ACK clusters.

      Note

      If you want to query a StorageClass conversion task or backup task, search for the corresponding backup name.

The status of the backup task is failed and the "HBR get empty backup info" error is returned

Issue

The status of the backup task is Failed and the "HBR get empty backup info" error is returned.

Cause

In hybrid cloud scenarios, the backup center uses the standard Kubernetes volume mount path as the data backup path by default. For example, for the standard CSI storage driver, the default mount path is /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount. The same applies to storage drivers that are officially supported by Kubernetes, such as NFS and FlexVolume.

In this case, /var/lib/kubelet is the default kubelet root path. If you modify this path in your Kubernetes cluster, Cloud Backup may not be able to access the data to be backed up.

Solution

Log on to the node where the volume is mounted and perform the following steps to troubleshoot the issue:

  1. Check whether the kubelet root path of the node is changed

    1. Run the following command to query the kubelet startup command:

      ps -elf | grep kubelet

      If the startup command contains the --root-dir parameter, the value of this parameter is the kubelet root path.

      If the startup command contains the --config parameter, the value of this parameter is the kubelet configuration file. If the file contains the root-dir field, the value of this field is the kubelet root path.

    2. If the startup command does not contain root path information, query the content of the kubelet service startup file /etc/systemd/system/kubelet.service. If the file contains the EnvironmentFile field, such as:

      EnvironmentFile=-/etc/kubernetes/kubelet

      The environment variable configuration file is /etc/kubernetes/kubelet. Query the content of the configuration file. If the file contains the following content:

      ROOT_DIR="--root-dir=/xxx"

      The kubelet root path is /xxx.

    3. If you cannot find any changes, the kubelet root path is the default path /var/lib/kubelet.

  2. Run the following command to check whether the kubelet root path is a symbolic link to another path:

    ls -al <root-dir>

    If the output is similar to the following content:

    lrwxrwxrwx   1 root root   26 Dec  4 10:51 kubelet -> /var/lib/container/kubelet

    The actual root path is /var/lib/container/kubelet.

  3. Verify that the data of the target storage volume exists under the root path.

    Make sure that the volume mount path <root-dir>/pods/<pod-uid>/volumes exists and that the subpath of the target type of storage volume exists under the path, such as kubernetes.io~csi or kubernetes.io~nfs.

  4. Add the environment variable KUBELET_ROOT_PATH = /var/lib/container/kubelet/pods to the csdr/csdr-controller stateless application. /var/lib/container/kubelet is the actual kubelet root path that you obtained by querying the configuration and symbolic link.

The status of the backup task is failed and the "check backup files in OSS bucket failed" or "upload backup files to OSS bucket failed" or "download backup files from OSS bucket failed" error is returned

Issue

The status of the backup task is Failed and the "upload backup files to OSS bucket failed" error is returned.

Cause

The OSS server returns an error when the component checks, uploads, or downloads backup files in the OSS bucket associated with the backup vault. The issue may arise due to one of the following causes:

  • Cause 1: Data encryption is enabled for the OSS bucket, but the related KMS permissions are not granted.

  • Cause 2: Some read and write permissions are missing when you install the component and configure permissions for ACK dedicated clusters and registered clusters.

  • Cause 3: The authentication credential of the RAM user that is used to configure permissions for ACK dedicated clusters and registered clusters is revoked.

Solution

The status of the backup task is PartiallyFailed and the "PROCESS velero partially completed" error is returned

Issue

The status of the backup task is PartiallyFailed and the "PROCESS velero partially completed" error is returned.

Cause

When you use the velero component to back up applications (resources in the cluster), the component fails to back up some resources.

Solution

Run the following command to identify the resources that fail to be backed up and the cause of the failure:

 kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name>

Fix the issue based on the content in the Errors and Warnings fields in the output.

If no direct cause of the failure is displayed, run the following command to obtain the related exception logs:

 kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero backup logs <backup-name>

If you cannot fix the problem based on the failure reason or abnormal logs, submit a ticket for processing.

The status of the backup task is PartiallyFailed and the "PROCESS hbr partially completed" error is returned

Issue

The status of the backup task is PartiallyFailed and the error message "PROCESS hbr partially completed" is returned.

Cause

When you use Cloud Backup to back up file system volumes, such as OSS volumes, NAS volumes, CPFS volumes, or local volumes, Cloud Backup fails to back up some resources. The issue may arise due to one of the following causes:

  • Cause 1: The storage plug-in used by some volumes is not supported.

  • Cause 2: Cloud Backup does not guarantee data consistency. If files are deleted during backup, the backup may fail.

Solution

  1. Log on to the Cloud Backup console.

  2. In the left-side navigation pane, choose Backup > Container Backup. On the Container Backup page, click the Backup Jobs tab.

  3. In the top navigation bar, select a region.

  4. On the Backup Jobs tab, select Job Name from the drop-down list next to the search box, enter <backup-name>-hbr in the search box, and click the search icon. If the volume backup fails or partially fails, you can view the casue. For more information, see Back up ACK clusters.

The status of the StorageClass conversion task is failed and the "storageclass xxx not exists" error is returned

Issue

The status of the StorageClass conversion task is Failed and the "storageclass xxx not exists" error is returned.

Cause

The target StorageClass that you select for StorageClass conversion does not exist in the current cluster.

Solution

  1. Run the following command to reset the StorageClass conversion task:

    cat << EOF | kubectl apply -f -
    apiVersion: csdr.alibabacloud.com/v1beta1
    kind: DeleteRequest
    metadata:
      name: reset-convert
      namespace: csdr
    spec:
      deleteObjectName: "<backup-name>"
      deleteObjectType: "Convert"
    EOF
  2. Create the desired StorageClass in the current cluster.

  3. Run the restore task again and configure StorageClass conversion.

The status of the StorageClass conversion task is failed and the "only support convert to storageclass with CSI diskplugin or nasplugin provisioner" error is returned

Issue

The status of the StorageClass conversion task is Failed and the error message "only support convert to storageclass with CSI diskplugin or nasplugin provisioner" is returned.

Cause

The target StorageClass that you select for StorageClass conversion is not an Alibaba Cloud CSI disk volume or NAS volume.

Solution

  • The current version only supports snapshot creation and recovery for disk and NAS types by default. If you have other recovery requirements, please contact relevant support submit a ticket.

  • If you are using a storage service that supports Internet access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application without the StorageClass conversion step. For more information, see Mount a statically provisioned ossfs 1.0 volume.

The status of the StorageClass conversion task is failed and the "current cluster is multi-zoned" error is returned

Issue

The status of the StorageClass conversion task is Failed and the "current cluster is multi-zoned" error is returned.

Cause

The current cluster is a multi-zone cluster. The StorageClass to which the current StorageClass is converted is disk volume and the volumeBindingMode is set to Immediate. If you use disk volumes in a multi-zone cluster, pods cannot be scheduled to the specified node and remain in the Pending state after disk volumes are created and mounted to the pods. For more information about the volumeBindingMode field, see StorageClass.

Solution

  1. Run the following command to reset the StorageClass conversion task:

    cat << EOF | kubectl apply -f -
    apiVersion: csdr.alibabacloud.com/v1beta1
    kind: DeleteRequest
    metadata:
      name: reset-convert
      namespace: csdr
    spec:
      deleteObjectName: "<backup-name>"
      deleteObjectType: "Convert"
    EOF
  2. If you want to convert to a disk StorageClass:

    • If you use the console, select alicloud-disk. alicloud-disk uses the alicloud-disk-topology-alltype StorageClass by default.

    • If you use the command line, select the alicloud-disk-topology-alltype type. alicloud-disk-topology-alltype is the default StorageClass provided by the CSI storage plug-in. You can also set volumeBindingMode to WaitForFirstConsumer.

  3. Run the restore task again and configure StorageClass conversion.

The status of the restore task is failed and the "multi-node writing is only supported for block volume" error is returned

Issue

The status of the restore or StorageClass conversion task is Failed and the error message "multi-node writing is only supported for block volume. For Kubernetes users, if unsure, use ReadWriteOnce access mode in PersistentVolumeClaim for disk volume" is returned.

Cause

To prevent the risk of forced disk detachment when a disk is mounted to another node during mounting, CSI checks the AccessModes configuration of disk volumes during mounting and prohibits the use of ReadWriteMany or ReadOnlyMany configuration.

The application to be backed up mounts a volume whose AccessMode is ReadWriteMany or ReadOnlyMany (mostly network storage that supports multiple mounts, such as OSS or NAS). When you restore the application to Alibaba Cloud disk storage that does not support multiple mounts by default, CSI may throw the preceding error.

Specifically, the following three scenarios may cause this error:

Scenario 1: The CSI version of the backup cluster is earlier (or the cluster uses the FlexVolume storage plug-in). Earlier CSI versions do not check the AccessModes field of Alibaba Cloud disk volumes during mounting, which causes the original disk volume to report an error when it is restored in a cluster with a later CSI version.

Scenario 2: The custom StorageClass used by the backup volume does not exist in the restore cluster. According to a certain matching rule, the volume is restored as an Alibaba Cloud disk volume by default in the new cluster.

Scenario 3: During restoration, you use the StorageClass conversion feature to manually specify that the backup volume is restored as an Alibaba Cloud disk volume.

Solution

Scenario 1: Starting from v1.8.4, the backup component supports automatic conversion of the AccessModes field of disk volumes to ReadWriteOnce. Upgrade the backup center component and then restore the application again.

Scenario 2: Automatic restoration of the StorageClass by the component in the target cluster may risk data inaccessibility or data overwriting. Create a StorageClass with the same name in the target cluster before restoration, or use the StorageClass conversion feature to specify the StorageClass to be used during restoration.

Scenario 3: When you restore a network storage volume as a disk volume, configure the convertToAccessModes parameter to convert AccessModes to ReadWriteOnce. For more information, see convertToAccessModes: the list of target AccessModes.

The status of the restore task is failed and the "only disk type PVs support cross-region restore in current version" error is returned

Issue

The status of the restore task is Failed and the error message "only disk type PVs support cross-region restore in current version" is returned.

Cause

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. Backups of other volume types cannot be restored across regions.

Solution

  • If you are using a storage service that supports Internet access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application. For more information, see Mount a statically provisioned ossfs 1.0 volume.

  • If you need to recover other types of storage data across regions, please submit a ticket.

The status of the restore task is failed and the "ECS snapshot cross region request failed" error is returned

Issue

The status of the restore task is Failed and the "ECS snapshot cross region request failed" error is returned.

Cause

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions, but the permissions to use Elastic Compute Service (ECS) disk snapshots are not granted.

Solution

If your cluster is an ACK dedicated cluster or a registered cluster that is connected to a self-managed Kubernetes cluster deployed on ECS instances, you must grant the permissions to use ECS disk snapshots. For more information, see Registered cluster.

The status of the restore task is failed and the "accessMode of PVC xxx is xxx" error is returned

Issue

The status of the restore task is Failed and the "accessMode of PVC xxx is xxx" error is returned.

Cause

The AccessMode of the disk volume to be restored is set to ReadOnlyMany (read-only multi-mount) or ReadWriteMany (read-write multi-mount).

When you restore the disk volume, the new volume is mounted by using CSI. Take note of the following items when you use the current version of CSI:

  • Only volumes with the multiAttach feature enabled can be mounted to multiple instances.

  • Volumes whose VolumeMode is set to Filesystem (mounted by using a file system such as ext4 or xfs) can only be mounted to multiple instances in read-only mode.

For more information about disk storage, see Use a dynamically provisioned disk volume.

Solution

  • If you are using the StorageClass conversion feature to convert a volume that supports multiple mounts, such as an OSS or NAS volume, to a disk volume, and you want to ensure that different replicas of your business can normally share data on the volume, we recommend that you create a new restore task and select alibabacloud-cnfs-nas as the target type for StorageClass conversion. This way, a NAS volume managed by CNFS is used. For more information, see Use CNFS to manage NAS file systems (recommended).

  • If the CSI version when you backed up the disk PV was low (without AccessMode detection), and the backed-up persistent volume itself does not meet the current CSI creation requirements, you should prioritize dynamically provisioned disk volumes to transform your original workloads. This helps avoid the threat of forced disk detachment when scheduling to other nodes. If you have more questions or requirements about multi-mount scenarios, please submit a ticket for assistance.

The status of the restore task is completed but some resources are not created in the restore cluster

Issue

The status of the restore task is Completed but some resources are not created in the restore cluster.

Cause

  • Cause 1: The resource is not backed up.

  • Cause 2: The resource is excluded during restoration based on the configuration.

  • Cause 3: The application restore subtask partially fails.

  • Cause 4: The resource is successfully restored but is recycled due to the ownerReferences configuration or business logic.

Solution

Solution 1:

Run the following command to view the backup details:

 kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name> --details

Check whether the target resource is backed up. If the target resource is not backed up, check whether it is excluded due to the namespace, resource, or other configurations specified in the backup task, and then back up the resource again. By default, cluster-level resources of running applications (pods) in namespaces that are not selected are not backed up. If you want to back up all cluster-level resources, see Cluster-level backup.

Solution 2:

If the target resource is not restored, check whether it is excluded due to the namespace, resource, or other configurations specified in the restore task, and then restore the resource again.

Solution 3:

Run the following command to identify the resources that fail to be restored and the cause of the failure:

 kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe restore <restore-name> 

Fix the issues according to the prompts in the Errors and Warnings fields in the outputs. If you cannot fix the issues based on the failure reasons, submit a ticket for processing.

Solution 4:

Check the audit of the corresponding resource to determine whether it is abnormally deleted after it is created.

The migrate-controller component in a cluster that uses FlexVolume cannot be launched

migrate-controller does not support clusters that use FlexVolume. To use the backup center feature, use one of the following methods to migrate from FlexVolume to CSI:

If you need to back up applications in a cluster that uses FlexVolume and restore the applications in a cluster that uses CSI during the migration from FlexVolume to CSI, see Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version.

Can I modify the backup vault?

You cannot modify the backup vault. If you want to modify the backup vault, you can only delete the current one and create a backup vault with another name.

Because the backup vault is shared, it may be in the Backup or Restore state at any time. If you modify a parameter of the backup vault, the system may fail to find the required data when backing up or restoring an application. Therefore, you cannot modify the backup vault or create backup vaults that use the same name.

Can I associate a backup vault with an OSS bucket whose name is not in the cnfs-oss-* format?

For clusters other than ACK dedicated clusters and registered clusters, the backup center component has read and write permissions on OSS buckets whose names are in the cnfs-oss-* format by default. To prevent backups from overwriting existing data in the bucket, we recommend that you create a dedicated OSS bucket whose name is in the cnfs-oss-* format for the backup center.

  1. If you want to associate a backup vault with an OSS bucket whose name is not in the cnfs-oss-* format, you must configure permissions for the component. For more information, see ACK dedicated cluster.

  2. After you grant permissions, run the following command to restart the backup service component:

    kubectl -n csdr delete pod -l control-plane=csdr-controller
    kubectl -n csdr delete pod -l component=csdr

    If you have created a backup vault that is associated with an OSS bucket whose name is not in the cnfs-oss-* format, wait until the connectivity check is complete and the status changes to Available before you attempt to back up or restore applications. The interval of connectivity checks is about five minutes. You can run the following command to query the status of the backup vault:

    kubectl -n csdr get backuplocation

    Expected result:

    NAME                    PHASE       LAST VALIDATED   AGE
    a-test-backuplocation   Available   7s               6d1h

How do I specify the backup cycle when I create a backup plan?

The backup cycle supports crontab expressions (such as 1 4 * * *) or interval-based backup (such as 6h30m), which means that a backup is created every 6h30m.

The following describes how to parse crontab expressions. The optional values are the same as the standard crontab expressions, except that the optional values of minute are 0 to 59. * indicates any available value for the given field. Sample crontab expressions:

  • 1 4 * * *: Create a backup at 4:01 AM every day.

  • 0 2 15 * 1: Create a backup at 2:00 AM on the 15th day of each month.

 *  *  *  *  * 
 |  |  |  |  |
 |  |  |  |  ·----- day of week (0 - 6) (Sun to Sat)
 |  |  |  ·-------- month (1 - 12) 
 |  |  .----------- day of month (1 - 31)
 |  ·-------------- hour (0 - 23) 
 ·----------------- minute (0 - 59)  
 

What changes are made to the YAML files of resources when I run a restore task?

When you restore resources, the following changes are made to the YAML files of resources:

Change 1:

If the size of a disk volume is less than 20 GiB, the volume size is changed to 20 GiB.

Change 2:

Services are restored based on Service types:

  • NodePort Services: By default, Service ports are retained when you restore Services across clusters.

  • LoadBalancer Services: When ExternalTrafficPolicy is set to Local, HealthCheckNodePort uses a random port by default. If you want to retain the port number, set spec.preserveNodePorts: true when you create a restore task.

    • If you restore a Service that uses an existing Server Load Balancer (SLB) instance in the backup cluster, the restored Service uses the same SLB instance and disables the listeners by default. You need to log on to the SLB console to configure the listeners.

    • If you restore a Service whose SLB instance is managed by CCM in the backup cluster, CCM creates a new SLB instance. For more information, see Considerations for configuring a LoadBalancer Service.

How do I view backup resources?

Resources in cluster application backups

The YAML files in the cluster are stored in the OSS bucket associated with the backup vault. You can use one of the following methods to view backup resources:

  • Run the following command in a cluster to which backup files are synchronized to view backup resources:

    kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1
    kubectl -n csdr exec -it csdr-velero-xxx -c velero -- ./velero describe backup <backup-name> --details
  • View backup resources in the ACK console:

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Application Backup.

    3. On the Application Backup page, click the Backup Records tab. In the Backup Records column, click the backup record that you want to view.

Resources in disk volume backups

  1. Log on to the ECS console.

  2. In the left-side navigation pane, choose Storage & Snapshots > Snapshots.

  3. In the top navigation bar, select the region and resource group of the resource that you want to manage. 地域

  4. On the Snapshots page, query snapshots based on the disk ID.

Resources in non-disk volume backups

  1. Log on to the Cloud Backup console.

  2. In the left-side navigation pane, choose Backup > Container Backup.

  3. In the top navigation bar, select a region.

  4. View the basic information of cluster backups.

    • Clusters: The list of clusters that have been backed up and protected. Click ACK Cluster ID to view the protected persistent volume claims (PVCs). For more information about PVCs, see Persistent volume claim (PVC).

      If Client Status is abnormal, Cloud Backup is not running as expected in the ACK cluster. Go to the DaemonSets page in the ACK console to troubleshoot the issue.image

    • Backup Jobs: The status of backup jobs.

      image

Can I back up applications in a cluster that runs an earlier Kubernetes version and restore the applications in a cluster that runs a later Kubernetes version?

Yes, you can.

By default, when you back up resources, all API versions supported by the resources are backed up. For example, a Deployment in a cluster that runs Kubernetes 1.16 supports extensions/v1beta1, apps/v1beta1, apps/v1beta2, and apps/v1. When you back up the Deployment, the backup vault stores all four API versions regardless of which version you use when you create the Deployment. The KubernetesConvert feature is used for API version conversion.

When you restore resources, the API version recommended by the restore cluster is used for restoration. For example, if you restore the preceding Deployment in a cluster that runs Kubernetes 1.28 and the recommended API version is apps/v1, the restored Deployment will use apps/v1.

Important

If no API version is supported by both clusters, you must manually deploy the resource. For example, Ingresses in clusters that run Kubernetes 1.16 support extensions/v1beta1 and networking.k8s.io/v1beta1. You cannot restore the Ingresses in clusters that run Kubernetes 1.22 or later because Ingresses in these clusters support only networking.k8s.io/v1. For more information about Kubernetes API version migration, see official documentation. Due to API version compatibility issues, we recommend that you do not use the backup center to migrate applications from clusters with newer Kubernetes versions to clusters with older Kubernetes versions. We also recommend that you do not migrate applications from clusters with Kubernetes versions earlier than 1.16 to clusters with newer Kubernetes versions.

Is traffic automatically switched to SLB instances during restoration?

No, traffic is not automatically switched.

Services are restored based on Service types:

  • NodePort Services: By default, Service ports are retained when you restore Services across clusters.

  • LoadBalancer Services: When ExternalTrafficPolicy is set to Local, HealthCheckNodePort uses a random port by default. If you want to retain the port number, set spec.preserveNodePorts: true when you create a restore task.

    • If you restore a Service that uses an existing Server Load Balancer (SLB) instance in the backup cluster, the restored Service uses the same SLB instance and disables the listeners by default. You need to log on to the SLB console to configure the listeners.

    • If you restore a Service whose SLB instance is managed by CCM in the backup cluster, CCM creates a new SLB instance. For more information, see Considerations for configuring a LoadBalancer Service.

By default, after listeners are disabled or new SLB instances are used, traffic is not automatically switched to the new SLB instances. If you use other cloud services or third-party service discovery and do not want automatic service discovery to switch traffic to the new SLB instances, you can exclude Service resources during backup and manually deploy them when you need to switch traffic.

Why are resources in the csdr, ack-csi-fuse, kube-system, kube-public, and kube-node-lease namespaces not backed up by default?Resources in namespaces such as kube-system, kube-public, and kube-node-lease?

  • csdr is the namespace of the backup center. If you directly back up and restore this namespace, components will fail to work in the restore cluster. Additionally, the backup center has a backup synchronization logic, which means you do not need to manually migrate backups to a new cluster.

  • ack-csi-fuse is the namespace of the CSI storage component and is used to run FUSE client pods maintained by CSI. When you restore storage in a new cluster, the CSI of the new cluster automatically synchronizes to the corresponding client. You do not need to manually back up and restore this namespace.

  • kube-system, kube-public, and kube-node-lease are the default system namespaces of Kubernetes clusters. Due to differences in cluster parameters and configurations, you cannot restore these namespaces across clusters. Additionally, the backup center is used to back up and restore applications. Before you run a restore task, you must install and configure system components in the restore cluster, such as:

    • Container Registry password-free image pulling component: You need to grant permissions to and configure acr-configuration in the restore cluster.

    • Application Load Balancer (ALB) Ingresses: You need to configure ALBConfigs.

    If you directly back up system components in the kube-system namespace to a new cluster, the system components may fail to run in the new cluster.

Does the backup center use ECS disk snapshots to back up disk volumes? What is the default type of snapshots?

In the following scenarios, the backup center uses ECS disk snapshots to back up disk volumes by default:

  1. The cluster is an ACK managed cluster or ACK dedicated cluster.

  2. The cluster runs Kubernetes 1.18 or later and uses CSI 1.18 or later.

In other scenarios, the backup center uses Cloud Backup to back up disk volumes by default.

Disk snapshots created by the backup center have the instant access feature enabled by default. The validity period of disk snapshots is the same as the validity period specified in the backup configuration by default. Starting from October 12, 2023, 11:00, Alibaba Cloud no longer charges for snapshot instant access storage or snapshot instant access operations in all regions. For more information, see Use the instant access feature.

Why is the validity period of ECS disk snapshots created from backups different from the validity period specified in the backup configuration?

The creation of disk snapshots depends on the csi-provisioner component or managed-csiprovisioner component of a cluster. If the version of the csi-provisioner component is earlier than 1.20.6, you cannot specify the validity period or enable the instant access feature when you create VolumeSnapshots. In this case, the validity period in the backup configuration does not affect disk snapshots.

Therefore, when you use the volume data backup feature for disk volumes, you must upgrade the csi-provisioner component to 1.20.6 or later.

If csi-provisioner cannot be upgraded to this version, you can configure the default snapshot validity period in the following ways:

  1. Update the backup center component migrate-controller to v1.7.10 or later.

  2. Run the following command to check whether a VolumeSnapshotClass whose retentionDays is 30 exists in the cluster:

    kubectl get volumesnapshotclass csdr-disk-snapshot-with-default-ttl
    • If the VolumeSnapshotClass does not exist, you can use the following YAML to create a VolumeSnapshotClass named csdr-disk-snapshot-with-default-ttl.

    • If the VolumeSnapshotClass exists, set the retentionDays parameter of the default csdr-disk-snapshot-with-default-ttl VolumeSnapshotClass to 30.

      apiVersion: snapshot.storage.k8s.io/v1
      deletionPolicy: Retain
      driver: diskplugin.csi.alibabacloud.com
      kind: VolumeSnapshotClass
      metadata:
        name: csdr-disk-snapshot-with-default-ttl
      parameters:
        retentionDays: "30"
  3. After the configuration is complete, all disk volume backups created in the cluster will create disk snapshots with the same validity period as the retentionDays field.

    Important

    If you want the validity period of ECS disk snapshots created from backups to be the same as the validity period specified in the backup configuration, we recommend that you upgrade the csi-provisioner component to 1.20.6 or later.

In what scenarios do I need to back up volumes when I back up applications?

What is volume data backup?

Volume data is backed up to cloud storage using ECS disk snapshots or the Cloud Backup service. When you restore the application, the data is stored in a new disk or NAS file system for the restored application to use. The restored application and the original application do not share data sources and do not affect each other.

If you do not need to copy data or have shared data source requirements, you can choose not to back up volume data and ensure that the list of excluded resources in the backup does not include PVC and PV resources. During restoration, the volumes are deployed to the new cluster based on the original YAML files.

In what scenarios do I need to back up volumes?

  • You want to implement data replication and disaster recovery.

  • The storage type is disk volume because basic disks can be mounted to only a single node.

  • You want to implement cross-region backup and restoration. In most cases, storage types other than OSS do not support cross-region access.

  • You want to isolate data between the backup application and the restored application.

  • The storage plug-ins or versions of the backup cluster and the restore cluster are significantly different, and the YAML files cannot be directly restored.

What are the risks of not backing up volumes for stateful applications?

If you do not back up volumes when you back up stateful applications, the following behaviors occur during restoration:

  • For volumes whose reclaim policy is Delete:

    Similar to when you deploy a PVC for the first time, if the restore cluster has a corresponding StorageClass, CSI automatically creates a new PV. For example, for disk storage, a new empty disk is mounted to the restored application. For static volumes that do not have a StorageClass specified or if the restore cluster does not have a corresponding StorageClass, the restored PVC and pod remain in the Pending state until you manually create a corresponding PV or StorageClass.

  • For volumes whose reclaim policy is Retain:

    During restoration, resources are restored in the order of PV first and then PVC based on the original YAML files. For storage that supports multiple mounts, such as NAS and OSS, the original file system or bucket can be directly reused. For disks, there may be a risk of forced disk detachment.

You can run the following command to query the reclaim policy of volumes:

kubectl get pv -o=custom-columns=CLAIM:.spec.claimRef.name,NAMESPACE:.spec.claimRef.namespace,NAME:.metadata.name,RECLAIMPOLICY:.spec.persistentVolumeReclaimPolicy

Expected result:

CLAIM               NAMESPACE           NAME                                       RECLAIMPOLICY
www-web-0           default             d-2ze53mvwvrt4o3xxxxxx                     Delete
essd-pvc-0          default             d-2ze5o2kq5yg4kdxxxxxx                     Delete
www-web-1           default             d-2ze7plpd4247c5xxxxxx                     Delete
pvc-oss             default             oss-e5923d5a-10c1-xxxx-xxxx-7fdf82xxxxxx   Retain

How do I select nodes that can be used to back up file systems in data protection?

By default, when you back up storage volumes other than Alibaba Cloud disk volumes, Cloud Backup is used for data backup and restoration. In this case, a Cloud Backup task must be executed on a node. The default scheduling policy of ACK Scheduler is the same as that of the Kubernetes scheduler. You can also configure tasks to be scheduled only to specific nodes based on your business requirements.

Note
  • Cloud Backup tasks cannot be scheduled to virtual nodes.

  • By default, backup tasks are low-priority tasks. For the same backup task, a maximum of one volume backup task can be executed on a node.

Node scheduling policies of the backup center

  • exclude policy (default): By default, all nodes can be used for backup and restoration. If you do not want Cloud Backup tasks to be scheduled to specific nodes, add the csdr.alibabacloud.com/agent-excluded="true" label to the nodes.

    kubectl label node <node-name-1> <node-name-2>  csdr.alibabacloud.com/agent-excluded="true"
  • include policy: By default, nodes without labels cannot be used for backup and restoration. Add the csdr.alibabacloud.com/agent-included="true" label to nodes that are allowed to execute Cloud Backup tasks.

    kubectl label node <node-name-1> <node-name-2>  csdr.alibabacloud.com/agent-included="true"
  • prefer policy: By default, all nodes can be used for backup and restoration. The scheduling priority is as follows:

    1. Nodes with the csdr.alibabacloud.com/agent-included="true" label have the highest priority.

    2. Nodes without special labels have the second highest priority.

    3. Nodes with the csdr.alibabacloud.com/agent-excluded="true" label have the lowest priority.

Change the node selection policy

  1. Run the following command to edit the csdr-config ConfigMap:

    kubectl -n csdr edit cm csdr-config

    Add the node_schedule_policy configuration to the applicationBackup configuration. Example:

    Click to view the complete example

    apiVersion: v1
    data:
      applicationBackup: |
        backup_max_worker_num: 15
        restore_max_worker_num: 5
        delete_max_worker_num: 30
        schedule_max_worker_num: 20
        convert_max_worker_num: 15
        node_schedule_policy: include  # Add this configuration. Valid values: include, exclude, and prefer.
      pvBackup: |
        batch_snapshot_max_num: 20
        enable_ecs_snapshot: "true"
    kind: ConfigMap
  2. Run the following command to restart the csdr-controller Deployment for the configuration to take effect:

    kubectl -n csdr delete pod -lapp=csdr-controller

What are the scenarios for application backup and data protection?

Application backup:

  • You want to back up your business in your cluster, including applications, Services, and configuration files.

  • Optional: When you back up an application, you want to also back up the volumes mounted to the application.

    Note

    The application backup feature does not back up volumes that are not mounted to pods.

    If you want to back up applications and all volumes, you can create data protection backup tasks.

  • You want to migrate applications between clusters and quickly restore applications for disaster recovery.

Data protection:

  • You want to back up volumes, including only PVCs and PVs.

  • You want to restore PVCs, which are independent of the backup data. When you use the backup center to restore a deleted PVC, a new disk is created and the data on the disk is identical to the data in the backup file. In this case, the mount parameters of the new PVC remain unchanged. The new PVC can be directly mounted to applications.

  • You want to implement data replication and disaster recovery.

Does the backup center support data encryption for associated OSS buckets? How do I grant the permissions to use Key Management Service (KMS) for server-side encryption?

OSS buckets support both server-side encryption and client-side encryption. However, the backup center supports only server-side encryption for OSS buckets. You can manually enable server-side encryption for the OSS bucket that you associate with the backup center and configure the encryption method in the OSS console. For more information about server-side encryption for OSS buckets and how to enable it, see Server-side encryption.

  • If you use a customer master key (CMK) managed by KMS for encryption and decryption and use your own key (BYOK), which means that you specify a CMK ID, you need to grant the backup center permissions to access KMS. Follow these steps:

    • Create a custom policy. For more information, see Create custom policies.

      {
        "Version": "1",
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "kms:List*",
              "kms:DescribeKey",
              "kms:GenerateDataKey",
              "kms:Decrypt"
            ],
            "Resource": [
              "acs:kms:*:141661496593****:*"
            ]
          }
        ]
      }

      The preceding policy allows the backup center to call all KMS keys under the Alibaba Cloud account ID. If you need more fine-grained Resource configuration, see Authorization information.

    • For ACK dedicated clusters and registered clusters, grant permissions to the RAM user that is used during installation. For more information, see Grant permissions to a RAM user. For other clusters, grant permissions to the AliyunCSManagedBackupRestoreRole role. For more information, see Grant permissions to a RAM role.

  • If you use a KMS key managed by OSS or use a key fully managed by OSS for encryption and decryption, you do not need to grant additional permissions.

How do I change the images used by applications during restoration?

Assume that the image used by the application in the backup is: docker.io/library/app1:v1

  • Change the image repository address (registry)

    In hybrid cloud scenarios, you may need to deploy an application across the clouds of multiple cloud service providers or you may need to migrate an application from the data center to the cloud. In this case, you must upload the image used by the application to an image repository on Container Registry.

    You must use the imageRegistryMapping field to specify the image repository address. For example, the following configuration changes the image to registry.cn-beijing.aliyuncs.com/my-registry/app1:v1.

    docker.io/library/: registry.cn-beijing.aliyuncs.com/my-registry/
  • Change the image repository (repository) and version

    Changing the image repository and version is an advanced feature. Before you create a restore task, you must specify the change details in a ConfigMap.

    If you want to change the image repository to app2:v2, create the following configuration:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: <ConfigMap name>
      namespace: csdr
      labels:
        velero.io/plugin-config: ""
        velero.io/change-image-name: RestoreItemAction
    data:
      "case1":"app1:v1,app2:v2"
      # If you want to change only the image repository, use the following setting.
      # "case1": "app1,app2"
      # If you want to change only the image version, use the following setting.
      # "case1": "v1:v2"
      # If you want to change only an image in an image repository, use the following setting.
      # "case1": "docker.io/library/app1:v1,registry.cn-beijing.aliyuncs.com/my-registry/app2:v2"

    If you have multiple change requirements, you can continue to configure case2, case3, and so on in the data field.

    After the ConfigMap is created, create a restore task as normal and leave the imageRegistryMapping field empty.

    Note

    The changes take effect on all restore tasks in the cluster. We recommend that you configure fine-grained modifications based on the preceding description. For example, configure image changes within a single repository. If the ConfigMap is no longer required, delete it.