All Products
Search
Document Center

Container Service for Kubernetes:FAQ about the backup center

Last Updated:Sep 04, 2024

This topic provides answers to some frequently asked questions about the backup center.

Table of contents

Category

Issue

Obtain error messages.

Common operations

Console

Common operations

Backup

StorageClass conversion (FKA snapshot creation)

Restoration

Others

Common operations

Note

If you use the kubectl CLI to access the backup center, you need to update the backup component named migrate-controller to the latest version before troubleshooting issues. The update does not affect existing backups. For more information about how to update the component, see Manage components.

If the status of your backup, StorageClass conversion (FKA snapshot creation), or restore task is Failed or PartiallyFailed, you can use the following methods to obtain the error message.

  • Move the pointer over Failed or PartiallyFailed in the Status column to view the brief error information, such as RestoreError: snapshot cross region request failed.image.png

  • To view the detailed error information, run one of the following commands to query the resource events of the task, such as RestoreError: process advancedvolumesnapshot failed avs: snapshot-hz, err: transition canceled with error: the ECS-snapshot related ram policy is missing.

    • Backup tasks

      kubectl -n csdr describe applicationbackup <backup-name> 
    • StorageClass conversion (FKA snapshot creation) tasks

      kubectl -n csdr describe converttosnapshot <backup-name>
    • Restore tasks

      kubectl -n csdr describe applicationrestore <restore-name>

What do I do if the console prompts "The components are abnormal." or "Failed to retrieve the current data."?

Symptom

The console prompts The components are abnormal. or Failed to retrieve the current data..

Causes

The installation of the backup center component is abnormal.

Solutions

What do I do if the console displays the following error: The name is already used. Change the name and try again?

Symptom

When you create or delete a backup, StorageClass conversion (FKA snapshot creation), or restore task, the console displays The name is already used. Change the name and try again.

Causes

When you delete a task in the console, a deletrequest resource is created in the cluster. The corresponding component performs multiple delete operations, including deleting the backup resources. For more information about how to use kubectl to perform relevant operations, see Use kubectl to back up and restore data.

If errors occur when the component performs delete operations or processes the deleterequest resource, some resources in the cluster are retained. Consequently, a resource with the same name may exist.

Solutions

  • Delete the resource with the same name as prompted. For example, if the deleterequests.csdr.alibabacloud.com "xxxxx-dbr" already exists error is displayed, you can run the following command to delete the resource:

    kubectl -n csdr delete deleterequests xxxxx-dbr
  • Create a task with a new name.

What do I do if no existing backup can be selected when I restore an application across clusters?

Symptom

When you restore an application across clusters, no existing backup can be selected to restore the application.

Causes

  • Cause 1: The backup vault is not associated with the current cluster, which means that the backup vault is not initialized.

    When the system initializes the backup vault, it synchronizes the basic information about the backup vault, including the Object Storage Service (OSS) bucket information, to the cluster. Then, the system initializes the backup files from the backup vault in the cluster. You can select a backup file from the backup vault to restore the application only after the backup vault is initialized.

  • Cause 2: The initialization of the backup vault fails, which means that the backuplocation resource in the current cluster is in the Unavailable state.

  • Cause 3: The backup task has not been completed or the backup task failed.

Solutions

  • Solution 1:

In the Create Restoration Task panel, click Initialize Backup Vault to the right of Backup Vaults, wait until the backup vault is initialized, and then select a backup file.

  • Solution 2:

Run the following command to query the status of the backuplocation resource:

kubectl get -ncsdr backuplocation <backuplocation-name> 

Expected output:

NAME                    PHASE       LAST VALIDATED   AGE
<backuplocation-name>   Available   3m36s            38m

If the status is Unavailable, refer to What do I do if the status of the task is Failed and the "VaultError: xxx" error is returned?.

Solution 3:

Log on to the console of the backup cluster and check whether the status of the backup task is Completed. If the status of the backup task is abnormal, troubleshoot the issue. For more information, see Table of contents.

What do I do if the console prompts that the dependent service-linked role of the current component is not assigned?

Symptom

When you access the application backup page, the console prompts that the dependent service-linked role of the current component is not assigned. The error code AddonRoleNotAuthorized is displayed.

Causes

The cloud resource authentication logic is optimized in the backup center component migrate-controller 1.8.0 for ACK managed clusters. If this is the first time you install or update to this component version with your Alibaba Cloud account, you must complete cloud resource authorization for your account.

Solutions

  • If you log on to the console with an Alibaba Cloud account, click Copy Authorization Link and click Authorize to grant permissions to your Alibaba Cloud account.

  • If you log on to the console with a RAM user, click Copy Authorization Link and send the link to the corresponding Alibaba Cloud account to complete authorization.

What do I do if the status of the task is Failed and the "internal error" error is returned?

To troubleshoot this issue, submit a ticket.

What do I do if the status of the task is Failed and the "create cluster resources timeout" error is returned?

Symptom

The status of the task is Failed and the "create cluster resources timeout" error is returned.

Causes

When the system runs a StorageClass conversion (FKA snapshot creation) or restore task, it may create temporary pods, persistent volume claims (PVCs), and persistent volumes (PVs). The "create cluster resources timeout" error is returned if these resources remain unavailable for a long period of time.

Solutions

  1. Run the following command to locate the abnormal resource and find the cause based on the events:

    kubectl -ncsdr describe <applicationbackup/converttosnapshot/applicationrestore> <task-name> 

    Expected output:

    ……wait for created tmp pvc default/demo-pvc-for-convert202311151045 for convertion bound time out

    The output indicates that the PVC used to convert the StorageClass remains in a state other than Bound for a long period of time. The namespace of the PVC is default and the name of the PVC is demo-pvc-for-convert202311151045.

  2. Run the following command to query the status of the PVC and locate the cause:

    kubectl -ndefault describe pvc demo-pvc-for-convert202311151045 

    The following list describes the common reasons that cause backup center-relevant issues. For more information, see Storage troubleshooting.

    • Cluster or node resources are insufficient or abnormal.

    • The StorageClass does not exist in the restore cluster. In this case, create a StorageClass conversion (FKA snapshot creation) task to convert the current StorageClass to an existing StorageClass in the restore cluster.

    • The storage resource associated with the StorageClass is unavailable. For example, the specified disk type is not supported in the current zone.

    • The CNFS system associated with alibabacloud-cnfs-nas is abnormal. For more information, see Use CNFS to manage NAS file systems (recommended).

    • You selected a StorageClass whose volumeBindingMode is Immediate when you restore an application in a multi-zone cluster.

What do I do if the status of the task is Failed and the "addon status is abnormal" error is returned?

Symptom

The status of the task is Failed and the "addon status is abnormal" error is returned.

Causes

The components in the csdr namespace are abnormal.

Solutions

For more information, see Cause 1 and solution: The components in the csdr namespace are abnormal.

What do I do if the status of the task is Failed and the "VaultError: xxx" error is returned?

Symptom

The status of the backup, restore, or snapshot conversion task is Failed, and the VaultError: backup vault is unavailable: xxx error is displayed.

Causes

  • The specified OSS bucket does not exist.

  • The cluster does not have permissions to access OSS.

  • The network of the OSS bucket is unreachable.

Solutions

  1. Log on to the OSS console. Check whether the OSS bucket that is associated with the backup vault exists.

    If the OSS bucket does not exist, create one and associate it with the backup vault. For more information, see Create a bucket.

  2. Check whether the cluster has permissions to access OSS.

    • Container Service for Kubernetes (ACK) Pro clusters: No OSS permissions are required. Make sure that the OSS buckets associated with the backup vault are named in the cnfs-oss-** format.

    • ACK dedicated clusters and registered clusters: OSS permissions are required. For more information, see Install migrate-controller and grant permissions.

    If you use methods other than the console to install migrate-controller 1.8.0 or update to this version in an ACK managed cluster, OSS permissions may be missing. You can run the following command to check whether a cluster has permissions to access OSS:

    kubectl get secret -n kube-system | grep addon.aliyuncsmanagedbackuprestorerole.token

    Expected output:

    addon.aliyuncsmanagedbackuprestorerole.token          Opaque                      1      62d

    If the returned content is the same as the preceding expected output, the cluster has permissions to access OSS. You need only to specify an OSS bucket that is named in the cnfs-oss-* format for the cluster.

    If the preceding content is not returned, use the following method to complete authorization.

    • Grant OSS permissions to ACK dedicated clusters and registered clusters. For more information, see Install migrate-controller and grant permissions.

    • If you use an Alibaba Cloud account, click Authorize to complete authorization. You only need to perform the authorization once for each Alibaba Cloud account.

    Note

    You cannot create a backup vault that uses the same name as a deleted one. You cannot associate a backup vault with an OSS bucket that is not named in the cnfs-oss-** format. If your backup vault is already associated with an OSS bucket that is not named in the cnfs-oss-** format, create another backup vault that uses a different name and associate the backup vault with an OSS bucket whose name meets the requirement.

  3. Run the following command to check the network configuration of the cluster:

    kubectl get backuplocation <backuplocation-name> -n csdr -o yaml | grep network

    Expected output:

    network: internal
    • When network is set to internal, the backup vault accesses the OSS bucket over the internal network.

    • When network is set to public, the backup vault accesses the OSS bucket over the Internet. If you access an OSS bucket over the Internet and a connection timeout error is returned, check if the cluster has access to the Internet. For more information, see Enable an existing ACK cluster to access the Internet.

    In the following scenarios, you must configure the backup vault to access the OSS bucket over the Internet:

    • The cluster and OSS bucket are deployed in different regions.

    • The cluster is an ACK Edge cluster.

    • The cluster is a registered cluster and is not connected to a virtual private cloud (VPC) through Cloud Enterprise Network (CEN), Express Connect, or VPN connections, or the cluster is a registered cluster connected to a VPC but no route points to the internal network of the region where the OSS bucket resides. In this case, you must configure a route that points to the internal network of the region where the OSS bucket resides.

    To configure the cluster to access the OSS bucket over the Internet, run the following command to enable Internet access for the OSS bucket. Replace <backuplocation-name> with the actual backup vault name and <region-id> with the region ID of the OSS bucket, such as cn-hangzhou.

    kubectl patch -ncsdr backuplocation/<backuplocation-name> --type='json' -p   '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'
    kubectl patch -ncsdr backupstoragelocation/<backuplocation-name> --type='json' -p   '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'

What do I do if the status of the backup, restore, or snapshot conversion task is Failed, and the backup location is not ok, please maybe check access oss error is returned?

Symptom

The status of the backup, restore, or snapshot conversion task is Failed and the backup location is not ok, please maybe check access oss error is returned.

Causes and solutions

Kubernetes 1.20 and later

Cause

The version of migrate-controller is outdated.

Solutions

Update migrate-controller to the latest version. For more information, see Manage components.

Kubernetes versions earlier than 1.20

Cause

  • The OSS subdirectory that is associated with a backup vault cannot be a parent or child directory of the OSS subdirectory that is associated with another backup vault. For example, you cannot use directories / and /A or directories /A and /A/B at the same time. In addition, the OSS subdirectories that are associated with backup vaults can store only backups generated by the backup center. If you store other data in the OSS subdirectories, the backup vault becomes unavailable.

  • The same cause described in What do I do if the status of the task is Failed and the "VaultError: xxx" error is returned?.

Solutions

The OSS subdirectory that is associated with a backup vault cannot be a parent or child directory of the OSS subdirectory that is associated with another backup vault. In addition, the OSS subdirectories that are associated with backup vaults can store only backups generated by the backup center. Run the following command to check the OSS subdirectories. Replace <backuplocation-name> with the actual backup vault name.

kubectl describe backupstoragelocation <backuplocation-name> -n csdr | grep message

Expected output:

Backup store contains invalid top-level directories: ****

The output indicates that other data is stored in the OSS directories associated with the backup vault. To resolve this issue, use one of the following methods:

  • Update the Kubernetes version of the cluster to Kubernetes 1.20 or later and update migrate-controller to the latest version.

  • Create a backup vault that is not associated with an OSS subdirectory and rename the backup vault. Do not delete the backup vaults that have the same name.

What do I do if the backup, restore, or snapshot conversion task remains in the Inprogress state for a long period of time?

Cause 1 and solution: The components in the csdr namespace are abnormal

Check the status of the components and identify the cause.

  1. Run the following command to check whether the components in the csdr namespace are restarted or cannot be launched:

    kubectl get pod -n csdr
  2. Run the following command to locate the cause:

    kubectl describe pod <pod-name> -n csdr
  • If the components are restarted due to an out of memory (OOM) error, perform the following steps:

    Run the following command to modify the resource limit of the Deployment. Set <deploy-name> of csdr-controller-*** to csdr-controller and set <deploy-name> of csdr-velero-*** to csdr-velero.

    kubectl patch deploy  <deploy-name> -p '{"spec":{"containers":{"resources":{"limits":"<new-limit-memory>"}}}}'
  • If the components cannot be launched due to insufficient Cloud Backup permissions, perform the following steps:

    1. Make sure that Cloud Backup is activated for the cluster.

      • If Cloud Backup is not activated, activate it. For more information, see Cloud Backup.

      • If Cloud Backup is activated, proceed to the next step.

    2. Make sure that the ACK dedicated cluster or registered cluster have Cloud Backup permissions.

    3. Run the following command to check whether the token required by the Cloud Backup client exists:

      kubectl describe <hbr-client-***>

      If a couldnt find key HBR_TOKEN event is generated, the token does not exist. Perform the following steps to resolve the issue:

      1. Run the following command to query the node that hosts hbr-client-***:

        kubectl get pod <hbr-client-***> -n csdr -owide
      2. Run the following command to change the value of labels: csdr.alibabacloud.com/agent-enable from true to false for the node.

        kubectl label node <node-name> csdr.alibabacloud.com/agent-enable=false --overwrite
        Important
        • When the system reruns the backup or restore task, the system automatically creates a token and launches hbr-client.

        • You cannot launch hbr-client by copying a token from another cluster to the current cluster. You need to delete the copied token and the corresponding hbr-client-*** pod and repeat the preceding steps.

Cause 2 and solution: The cluster does not have snapshot permissions to create disk snapshots

If you back up the disk volume that is mounted to your application but the backup task remains in the Inprogress state for a long period of time, run the following command to query the newly created VolumeSnapshots in the cluster:

kubectl get volumesnapshot -n <backup-namespace>

Expected output:

NAME                    READYTOUSE      SOURCEPVC         SOURCESNAPSHOTCONTENT         ...
<volumesnapshot-name>   true                              <volumesnapshotcontent-name>  ...

If the READYTOUSE state of all VolumeSnapshots remains false for a long period of time, perform the following steps:

  1. Log on to the Elastic Compute Service (ECS) console and check whether the disk snapshot feature is enabled.

    • If the feature is disabled, enable the feature in the corresponding region. For more information, see Activate ECS Snapshot.

    • If the feature is enabled, proceed to the next step.

  2. Check whether the CSI component of the cluster runs as normal.

    kubectl -nkube-system get pod -l app=csi-provisioner
  3. Check whether the permissions to use disk snapshots are granted.

    ACK dedicated clusters

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, click Cluster Information.

    3. On the Cluster Information page, click the Cluster Resources tab. Click the link next to Master RAM Role to go to the permission management page.

    4. On the Policies page, check whether the permissions to use disk snapshots are granted.

      • If the k8sMasterRolePolicy-Csi-*** policy exists and the policy provides the k8sMasterRolePolicy-Csi-*** and k8sMasterRolePolicy-Csi-*** permissions, the required permissions are granted. In this case, submit a ticket.

      • If the k8sMasterRolePolicy-Csi-*** policy does not exist, attach the following policy to the master RAM role to grant the permissions to use disk snapshots. For more information, see Create custom policies and Grant permissions to a RAM role.

        {
            "Version": "1",
            "Statement": [
                {
                    "Action": [
                        "ecs:DescribeDisks",
                        "ecs:DescribeInstances",
                        "ecs:DescribeAvailableResource",
                        "ecs:DescribeInstanceTypes",
                        "nas:DescribeFileSystems",
                        "ecs:CreateSnapshot",
                        "ecs:DeleteSnapshot",
                        "ecs:DescribeSnapshotGroups",
                        "ecs:CreateAutoSnapshotPolicy",
                        "ecs:ApplyAutoSnapshotPolicy",
                        "ecs:CancelAutoSnapshotPolicy",
                        "ecs:DeleteAutoSnapshotPolicy",
                        "ecs:DescribeAutoSnapshotPolicyEX",
                        "ecs:ModifyAutoSnapshotPolicyEx",
                        "ecs:DescribeSnapshots",
                        "ecs:CopySnapshot",
                        "ecs:CreateSnapshotGroup",
                        "ecs:DeleteSnapshotGroup"
                    ],
                    "Resource": [
                        "*"
                    ],
                    "Effect": "Allow"
                }
            ]
        }
    5. If the issue persists after you perform the preceding steps, submit a ticket.

    ACK managed clusters

    1. Log on to the RAM console as a RAM user who has administrative rights.

    2. In the left-side navigation pane, choose Identities > Roles.

    3. On the Roles page, enter AliyunCSManagedCsiRole in the search box. Check whether the policy of the role contains the following content:

      {
          "Version": "1",
          "Statement": [
              {
                  "Action": [
                      "ecs:DescribeDisks",
                      "ecs:DescribeInstances",
                      "ecs:DescribeAvailableResource",
                      "ecs:DescribeInstanceTypes",
                      "nas:DescribeFileSystems",
                      "ecs:CreateSnapshot",
                      "ecs:DeleteSnapshot",
                      "ecs:DescribeSnapshotGroups",
                      "ecs:CreateAutoSnapshotPolicy",
                      "ecs:ApplyAutoSnapshotPolicy",
                      "ecs:CancelAutoSnapshotPolicy",
                      "ecs:DeleteAutoSnapshotPolicy",
                      "ecs:DescribeAutoSnapshotPolicyEX",
                      "ecs:ModifyAutoSnapshotPolicyEx",
                      "ecs:DescribeSnapshots",
                      "ecs:CopySnapshot",
                      "ecs:CreateSnapshotGroup",
                      "ecs:DeleteSnapshotGroup"
                  ],
                  "Resource": [
                      "*"
                  ],
                  "Effect": "Allow"
              }
          ]
      }

    Registered clusters

    The disk snapshot feature is available only for registered clusters that contain only ECS nodes. Check whether you have the permissions to use the CSI plug-in. For more information, see Grant a RAM user the permissions to manage the CSI plug-in.

Cause 3 and solution: Volume types other than disk volumes are used

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. Backups of other volume types cannot be restored across regions. If you are using a storage service that supports public access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application. For more information, see Mount a statically provisioned OSS volume.

What do I do if the status of the backup task is Failed and the "backup already exists in OSS bucket" error is returned?

Symptom

The status of the backup task is Failed and the "backup already exists in OSS bucket" error is returned.

Causes

A backup with the same name is stored in the OSS bucket associated with the backup vault.

Reasons why the backup is invisible in the current cluster:

  • Backups in ongoing backup tasks and failed backup tasks are not synchronized to other clusters.

  • If you delete a backup in a cluster other than the backup cluster, the backup file in the OSS bucket is labeled but not deleted. The labeled backup file will not be synchronized to newly associated clusters.

  • The current cluster is not associated with the backup vault that stores the backup, which means that the backup vault is not initialized.

Solutions

Recreate a backup vault with another name.

What do I do if the status of the backup task is Failed and the "get target namespace failed" error is returned?

Symptom

The status of the backup task is Failed and the "get target namespace failed" error is returned.

Causes

In most cases, this error occurs in backup tasks that are created at a scheduled time. The cause varies based on the way how you select namespaces.

  • If you created an include list, the cause is that all selected namespaces are deleted.

  • If you created an exclude list, the cause is that no namespace other than the excluded namespaces exists in the cluster.

Solutions

Modify the backup plan to change the method that is used to select namespaces and change the namespaces that you have selected.

What do I do if the status of the backup task is Failed and the "velero backup process timeout" error is returned?

Symptom

The status of the backup task is Failed and the "velero backup process timeout" error is returned.

Causes

  • Cause 1: A subtask of the backup task times out. The duration of a subtask varies based on the amount of cluster resources and the response latency of the API server. In migrate-controller 1.7.7 and later, the default timeout period of subtasks is 60 minutes.

  • Cause 2: The storage class of the bucket used by the backup vault is Archive, Cold Archive, or Deep Cold Archive. Files that record metadata must be updated by the component on the OSS server to ensure the consistency of the backup process. Files that are not restored do not support this operation.

Solutions

  • Solution 1: Modify the global timeout period of subtasks in the backup cluster.

    Run the following command to add velero_timeout_minutes to applicationBackup. Unit: minutes.

    kubectl edit -ncsdr cm csdr-config

    For example, the following code block sets the timeout period to 100 minutes:

    apiVersion: v1
    data:
      applicationBackup: |
        ... #Details not shown.
        velero_timeout_minutes: 100

    After you modify the timeout period, run the following command to restart csdr-controller for the modification to take effect:

    kubectl -ncsdr delete pod -l control-plane=csdr-controller
  • Solution 2: Change the storage class of the bucket used by the backup vault to Standard.

    If you want to store backup data in Archive, you can configure lifecycle rules to automatically convert storage classes and restore the data before recovery. For more information, see Convert storage classes.

What do I do if the status of the backup task is Failed and the "HBR backup request failed" error is returned?

Symptom

The status of the backup task is Failed and the "HBR backup request failed" error is returned.

Causes

  • Cause 1: The volume plug-in used by the cluster is incompatible with Cloud Backup.

  • Cause 2: Cloud Backup does not support creating backups for volumes whose volume mode is Block. For more information, see Volume Mode.

  • Cause 3: The Cloud Backup client encounters an exception. In this case, tasks that back up or restore file system volumes, such as OSS volumes, NAS volumes, CPFS volumes, or local volumes, will time out or fail.

Solutions

  • Solution 1: If your cluster does not use the CSI plug-in or the cluster does not use common Kubernetes volumes, such as NFS volumes or local volumes, submit a ticket.

  • Solution 2: Submit a ticket.

  • Solution 3: Perform the following steps:

    1. Log on to the Cloud Backup console.

    2. In the left-side navigation pane, choose Backup > Container Backup.

    3. In the top navigation bar, select a region.

    4. On the Backup Jobs tab, search for <backup-name>-hbr in the Job Name search box, check the status of the backup task, and identify the cause. For more information, see Back up ACK clusters.

      Note

      To query StorageClass conversion or backup tasks, search for the corresponding backup names.

What do I do if the status of the backup task is PartiallyFailed and the "PROCESS velero partially completed" error is returned?

Symptom

The status of the backup task is PartiallyFailed and the "PROCESS velero partially completed" error is returned.

Causes

When you use the velero component to back up applications (resources in the cluster), the component fails to back up some resources.

Solutions

Run the following command to query the resources that the component fails to back up and identify the cause:

 kubectl -ncsdr exec -it $(kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name>

Fix the issue based on the content of the Errors and Warnings fields in the output.

If no errors returned in the content, run the following command to obtain error logs:

 kubectl -ncsdr exec -it $(kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero backup logs <backup-name>

If you cannot identify the cause, submit a ticket.

What do I do if the status of the backup task is PartiallyFailed and the "PROCESS hbr partially completed" error is returned?

Symptom

The status of the backup task is PartiallyFailed and the "PROCESS hbr partially completed" error is returned.

Causes

When you use Cloud Backup to back up file system volumes, such as OSS volumes, NAS volumes, CPFS volumes, or local volumes, Cloud Backup fails to back up some resources. Possible causes:

  • Cause 1: The volume plug-in does not support Cloud Backup.

  • Cause 2: Cloud Backup cannot guarantee data consistency. If files are deleted during the backup process, the backup task fails.

Solutions

  1. Log on to the Cloud Backup console.

  2. In the left-side navigation pane, choose Backup > Container Backup.

  3. In the top navigation bar, select a region.

  4. On the Backup Jobs tab, search for <backup-name>-hbr in the Job Name search box and identify the cause of the backup failure. For more information, see Back up ACK clusters.

What do I do if the status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "storageclass xxx not exists" error is returned?

Symptom

The status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "storageclass xxx not exists" error is returned.

Causes

The StorageClass to which the current StorageClass is converted does not exist in the current cluster.

Solutions

  1. Run the following command to reset the StorageClass conversion (FKA snapshot creation) task:

    cat << EOF | kubectl apply -f -
    apiVersion: csdr.alibabacloud.com/v1beta1
    kind: DeleteRequest
    metadata:
      name: reset-convert
      namespace: csdr
    spec:
      deleteObjectName: "<backup-name>"
      deleteObjectType: "Convert"
    EOF
  2. Create the desired StorageClass in the current cluster.

  3. Rerun the StorageClass conversion (FKA snapshot creation) task.

What do I do if the status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "only support convert to storageclass with CSI diskplugin or nasplugin provisioner" error is returned?

Symptom

The status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "only support convert to storageclass with CSI diskplugin or nasplugin provisioner" error is returned.

Causes

The StorageClass to which the current StorageClass is converted is not supported by the CSI component.

Solutions

  • The current CSI version supports snapshots of only disk volumes and NAS volumes. If you want to use snapshots of other volume types, submit a ticket.

  • If you are using a storage service that supports public access, such as OSS, you need to create a statically provisioned PVC and PV and then directly restore the application. No StorageClass conversion is needed. For more information, see Mount a statically provisioned OSS volume.

What do I do if the status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "current cluster is multi-zoned" error is returned?

Symptom

The status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "current cluster is multi-zoned" error is returned.

Causes

The current cluster is a multi-zone cluster. The StorageClass to which the current StorageClass is converted is disk volume and the volumeBindingMode is set to Immediate. If you use disk volumes in a multi-zone cluster, pods cannot be scheduled to the specified node and remain in the Pending state after disk volumes are created and mounted to the pods. For more information about the volumeBindingMode, see Disk volume overview.

Solutions

  1. Run the following command to reset the StorageClass conversion (FKA snapshot creation) task:

    cat << EOF | kubectl apply -f -
    apiVersion: csdr.alibabacloud.com/v1beta1
    kind: DeleteRequest
    metadata:
      name: reset-convert
      namespace: csdr
    spec:
      deleteObjectName: "<backup-name>"
      deleteObjectType: "Convert"
    EOF
  2. To convert the StorageClass to disk volume, perform the following steps.

    • To use the console to convert the StorageClass to disk volume, select alicloud-disk. By default, alicloud-disk uses the alicloud-disk-topology-alltype StorageClass.

    • To use the CLI to convert the StorageClass to disk volume, we recommend that you select alicloud-disk-topology-alltype, which is the default StorageClass provided by the CSI plug-in. You can also set volumeBindingMode to WaitForFirstConsumer.

  3. Rerun the StorageClass conversion (FKA snapshot creation) task.

What do I do if the status of the restore task is Failed and the "only disk type PVs support cross-region restore in current version" error is returned?

Symptom

The status of the restore task is Failed and the "only disk type PVs support cross-region restore in current version" error is returned.

Causes

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. Backups of other volume types cannot be restored across regions.

Solutions

  • If you are using a storage service that supports public access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application. For more information, see Mount a statically provisioned OSS volume.

  • If you want to restore backups of other volume types across regions, submit a ticket.

What do I do if the status of the restore task is Failed and the "ECS snapshot cross region request failed" error is returned?

Symptom

The status of the restore task is Failed and the "ECS snapshot cross region request failed" error is returned.

Symptom

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. However, your cluster is unauthorized to use ECS disk snapshots.

Solutions

If your cluster is an ACK dedicated cluster or a registered cluster that is connected to a self-managed Kubernetes cluster, you need to authorize the cluster to use ECS disk snapshots. For more information, see Grant permissions to a registered cluster.

What do I do if the status of the restore task is Completed but some resources are not created in the restore cluster?

Symptom

The status of the restore task is Completed but some resources are not created in the restore cluster

Causes

  • Cause 1: No backups are created for the resources.

  • Cause 2: The resources are excluded from the restore list.

  • Cause 3: Some subtasks in the application restoration task failed.

  • Cause 4: The resources are restored but then reclaimed due to ownerReferences or business logic.

Solutions

Solution 1:

Run the following command to query backup details:

 kubectl -ncsdr exec -it $(kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name> --details

Check whether backups are created for the resources. If no backups are created, make sure that the resources and the namespaces of the resources are specified in the include list of the backup task, or the resources and namespaces are not specified in the exclude list of the backup task. Then, rerun the backup task. Cluster-level pod resources are not backed up if the backup task is not configured to back up the namespace of the pods. To back up all cluster-level resources, see Create backups in the backup cluster.

Solution 2:

If the resources are not restored, make sure that the resources and the namespaces of the resources are specified in the include list of the restore task, or the resources and namespaces are not specified in the exclude list of the restore task. Then, rerun the restore task.

Solution 3:

Run the following command to query the resources and identify the cause:

 kubectl -ncsdr exec -it $(kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe restore <restore-name> 

Fix the issue based on the content of the Errors and Warnings fields in the output. If you cannot identify the cause, submit a ticket.

Solution 4:

Query the auditing records of the resources and check whether the resources are accidentally deleted.

What do I do if the migrate-controller component in a cluster that uses FlexVolume cannot be launched?

migrate-controller does not support clusters that use FlexVolume. To use the backup center feature, use one of the following methods to migrate from FlexVolume to CSI:

If you want to create backups in the FlexVolume cluster and restore the backups in the CSI cluster when migrating from FlexVolume to CSI, refer to Use the backup center to migrate applications from an ACK cluster that runs an earlier Kubernetes version.

Can I modify the backup vault?

You cannot modify the backup vault of the backup center. You can only delete the current one and create a backup vault with another name.

Backup vaults are shared resources. Existing backup vaults may be in the Backing Up or Restoring state. If you modify a parameter of the backup vault, the system may fail to find the required data when backing up or restoring an application. Therefore, you cannot modify the backup vault or create back vaults that use the same name.

Can I associate the backup vault with an OSS bucket that is not named in the "cnfs-oss-*" format?

For clusters other than ACK dedicated clusters and registered clusters, the backup center component has read and write permissions on OSS buckets named in the cnfs-oss-* format by default. To prevent backups from overwriting the original data in the OSS buckets, we recommend that you create an OSS bucket named in the cnfs-oss-* format in the backup center.

  1. To associate an OSS bucket that is not named in the cnfs-oss-* format with the backup vault, you need to grant permissions to the backup center component. For more information, see ACK dedicated clusters.

  2. After you grant permissions, run the following command to restart the component:

    kubectl -ncsdr delete pod -l control-plane=csdr-controller
    kubectl -ncsdr delete pod -l component=csdr

    If an OSS bucket that is not named in the cnfs-oss-* format is already associated with the backup vault, rerun the backup or restore task after the connectivity test is complete and the status of the backup vault changes to Available. The connectivity test is performed at intervals of 5 minutes. You can run the following command to query the status of the backup vault:

    kubectl -ncsdr get backuplocation

    Expected output:

    NAME                    PHASE       LAST VALIDATED   AGE
    a-test-backuplocation   Available   7s               6d1h

How do I specify the backup cycle when creating a backup plan?

You can specify the backup cycle by using a crontab expression, such as 1 4 * * *. You can also directly specify an interval. For example, if you set the backup cycle to 6h30m, the backup operation is performed every 6 hours and 30 minutes.

The asterisks (*) in the crontab expression represent any valid values of the corresponding fields. Valid values of the minute field are 0 to 59. Sample crontab expressions:

  • 1 4 * * *: The backup operation is performed at 4:01 am each day.

  • 0 2 15 * 1: The backup operation is performed at 2:00 am on the 15th day of each month.

 *  *  *  *  * 
 |  |  |  |  |
 |  |  |  |  ·----- day of week (0 - 6) (Sun to Sat)
 |  |  |  ·-------- month (1 - 12) 
 |  |  .----------- day of month (1 - 31)
 |  ·-------------- hour (0 - 23) 
 ·----------------- minute (0 - 59)  
 

What are the default changes in resource YAML files when I run a restore task?

When you restore resources, the following changes are made to the YAML files of resources:

Change 1:

If the size of a disk volume is less than 20 GiB, the volume size is changed to 20 GiB.

Change 2:

Services are restored based on Service types:

  • NodePort Services: The ports of NodePort Services are retained by default during cross-cluster restoration.

  • LoadBalancer Services: When ExternalTrafficPolicy is set to Local, HealthCheckNodePort uses a random port by default. To retain the port, specify spec.preserveNodePorts: true when you create the restore task.

    • If a Service in the backup cluster uses an existing Server Load Balancer (SLB) instance, the Service restored in the restore cluster still uses the original SLB instance but has all listeners disabled by default. You need to configure the listeners in the SLB console.

    • LoadBalancer Services in the backup cluster are managed by the cloud controller manager (CCM). When the system restores these Services, the CCM will create SLB instances. For more information, see Considerations for configuring a LoadBalancer type Service.

How do I view backup resources?

Application-related backup resources

The YAML files in the cluster are stored in the OSS bucket associated with the backup vault. You can use one of the following methods to view backup resources.

  • Run the following command in a cluster to which backup files are synchronized to view backup resources:

    kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1
    kubectl -ncsdr exec -it csdr-velero-xxx -cvelero -- ./velero describe backup <backup-name> --details
  • View backup resources in the ACK console.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Application Backup.

    3. On the Application Backup page, click the Backup Records tab. In the Backup Records column, click the backup record that you want to view.

Disk volume-related backup resources

  1. Log on to the ECS console.

  2. In the left-side navigation pane, choose Storage & Snapshots > Snapshots.

  3. In the top navigation bar, select the region and resource group to which the resource belongs. 地域

  4. On the Snapshots page, query snapshots based on the disk ID.

Other backup resources

  1. Log on to the Cloud Backup console.

  2. In the left-side navigation pane, choose Backup > Container Backup.

  3. In the top navigation bar, select a region.

  4. View the basic information of cluster backups.

    • Clusters: The list of clusters that have been backed up and protected. Click ACK Cluster ID to view the protected persistent volume claims (PVCs). For more information about PVCs, see Persistent volume claim (PVC).

      If Client Status is abnormal, Cloud Backup is not running as expected in the ACK cluster. Go to the DaemonSets page in the ACK console to troubleshoot the issue.image

    • Backup Jobs: The status of backup jobs.

      image

If I back up data in a cluster that runs an earlier Kubernetes version, can I restore the data in a cluster that runs a later Kubernetes version?

Yes.

By default, when you back up resources, all API versions supported by the resources are backed up. For example, a Deployment in a cluster that runs Kubernetes 1.16 supports extensions/v1beta1, apps/v1beta1, apps/v1beta2, and apps /v1. When you back up the Deployment, the backup vault stores all four API versions regardless of which version you use when you create the Deployment. The KubernetesConvert feature is used for API version conversion.

When you restore resources, the API version recommended by the restore cluster is used to restore the resources. For example, if you restore the preceding Deployment in a cluster that runs Kubernetes 1.28 and the recommended API version is apps/v1, the restored Deployment uses apps/v1.

Important

If no API version is supported by both clusters, you must manually deploy the resource. For example, Ingresses in clusters that run Kubernetes 1.16 support extensions/v1beta1 and networking.k8s.io/v1beta1. You cannot restore the Ingresses in clusters that run Kubernetes 1.22 or later because Ingresses in these clusters support only networking.k8s.io/v1. For more information about API version migration, see Official documentation. Due to API version compatibility issues, we recommend that you do not use the backup center to migrate applications from clusters of later Kubernetes versions to clusters of earlier Kubernetes versions. We also recommend that you do not migrate applications from clusters of Kubernetes versions earlier than 1.16 to clusters of later Kubernetes versions.

Is traffic automatically switched to new SLB instances when I run a restore task?

No.

Services are restored based on Service types:

  • NodePort Services: The ports of NodePort Services are retained by default during cross-cluster restoration.

  • LoadBalancer Services: When ExternalTrafficPolicy is set to Local, HealthCheckNodePort uses a random port by default. To retain the port, specify spec.preserveNodePorts: true when you create the restore task.

    • If a Service in the backup cluster uses an existing Server Load Balancer (SLB) instance, the Service restored in the restore cluster still uses the original SLB instance but has all listeners disabled by default. You need to configure the listeners in the SLB console.

    • LoadBalancer Services in the backup cluster are managed by the cloud controller manager (CCM). When the system restores these Services, the CCM will create SLB instances. For more information, see Considerations for configuring a LoadBalancer type Service.

By default, after listeners are disabled or new SLB instances are used, traffic is not automatically switched to the new SLB instances. If you use other cloud services or third-party service discovery and do not want service discovery to switch traffic to new SLB instances, you can exclude Services when you back up resources. You can manually deploy Services when you want to switch traffic.

Why are resources in the csdr, kube-system, kube-public, and kube-node namespaces not backed up by default?

csdr is the namespace of the backup center. If you directly back up and restore the namespace, components fail to work in the restore cluster. In addition, the backup and synchronization logic of the backup center does not require you to manually migrate backups to a new cluster.

kube-system, kube-public, and kube-node-lease are the default system namespaces of Kubernetes clusters. Due to the differences in cluster parameters and configurations, you cannot restore the namespaces across clusters. The backup center is used to back up and restore applications. Before you run a restore task, you must install and configure system components in the restore cluster. For example, the following add-ons are automatically installed when the system creates a cluster:

  • Container Registry password-free image pulling component: You need to grant permissions to and configure acr-configuration in the restore cluster.

  • ALB Ingresses: You need to configure ALBConfigs.

You cannot directly restore kube-system components in the new cluster. Otherwise, the system components cannot work as expected.

Does the backup center use ECS disk snapshots to back up disks? What is the default snapshot type?

In the following scenarios, the backup center uses ECS disk snapshots to back up disks.

  1. The cluster is an ACK managed cluster or ACK dedicated cluster.

  2. The Kubernetes version of the cluster is 1.18 or later, and the cluster CSI plug-in version is 1.18 or later.

In other scenarios, the backup center uses Cloud Backup to back up disks.

By default, the instant access feature is enabled for disk snapshots created by the backup center. The validity period of the snapshots is the same as the validity period specified in the backup configuration. Starting 11:00 (UTC+8) on October 12, 2023, you are no longer charged storage fees and feature usage fees for the instant access feature. For more information, see Use the instant access feature.

Why is the validity period of a disk snapshot created by the backup center different from the validity period specified in the backup configuration?

The creation of disk snapshots depends on the csi-provisioner component or managed-csiprovisioner component of a cluster. If the version of the csi-provisioner component is earlier than 1.20.6, you cannot specify the validity period or enable the instant access feature when you create VolumeSnapshots. In this case, the validity period in the backup configuration does not take effect on disk snapshots.

Therefore, when you back up disk volumes, you need to update the csi-provisioner component to 1.20.6 or later.

If csi-provisioner cannot be updated to this version, you can configure the default snapshot validity period in the following ways:

  1. Update the backup center component migrate-controller to v1.7.10 or later.

  2. Run the following command to check whether a VolumeSnapshotClass whose retentionDays is 30 exists in the cluster.

    kubectl get volumesnapshotclass csdr-disk-snapshot-with-default-ttl
    • If the VolumeSnapshotClass does not exist, use the following YAML to create a VolumeSnapshotClass named csdr-disk-snapshot-with-default-ttl.

    • If the VolumeSnapshotClass exists, set retentionDays to 30.

      apiVersion: snapshot.storage.k8s.io/v1
      deletionPolicy: Retain
      driver: diskplugin.csi.alibabacloud.com
      kind: VolumeSnapshotClass
      metadata:
        name: csdr-disk-snapshot-with-default-ttl
      parameters:
        retentionDays: "30"
  3. After the configuration is complete, when you back up disk volumes, disk snapshots whose validity period is the same as the value of retentionDays are created.

    Important

    To ensure that the validity period of ECS disk snapshots created by the backup center is the same as the validity period specified in the backup configuration, we recommend that you update the csi-provisioner component to v1.20.6 or later.

What scenarios are suitable for backing up volumes and what do I do if I want to back up volumes?

Volume backup

You can use ECS disk snapshots or Cloud Backup to back up data stored in volumes to datastores in the cloud. Then, you can restore the data from the backup files to disks or NAS file systems used by your application. The original application and restored application do not share the data source.

If you do not need to replicate the data or share the data source, you can skip the volume backup step. Make sure that the exclude list of the backup task does not include PVCs or PVs. When you restore the application, directly deploy the YAML file of the original volume in the restore cluster.

Important

If the backup and restore clusters use different volume plug-ins, you cannot use the YAML file of the original volume. You must migrate from FlexVolume to CSI. For more information, see Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version.

What scenarios are suitable for backing up volumes?

  • Disaster recovery and version recording.

  • Only disk volumes are used. Each basic disk can be mounted to only one node. If you use the YAML file of the original disk volume, the disk volume may be unmounted from the original node.

  • Cross-region backup and restoration. In most cases, only OSS supports inter-region communication.

  • The data of the application in the backup cluster must be isolated from the data of the application in the restore cluster.

  • The backup and restore clusters use different volume plug-ins or the version difference is great. In this case, you cannot directly use the YAML file of the original volume.

What do I do if I want to back up volumes?

  • When you create a backup task in the console, select Volume Backup.

  • When you use kubectl to create a backup task, set spec.pvBackup.defaultPvBackup to true.

In what scenarios are application backup and data protection applicable respectively?

Application backup:

  • The backup target can be any resource running in the cluster, such as applications, services, or configuration files.

  • You can also select to back up data from volumes mounted by applications.

    Note

    Data from volumes that is not mounted by any pod will not be backed up.

    To back up both the application and all associated data from volumes, we recommend that you create backups by using the data protection type.

  • Cluster migration and rapid recovery of applications in disaster recovery scenarios.

Data Protection (New):

  • The backup target is data in volumes, which includes resources only in persistent volume claims (PVCs) and persistent volumes (PVs).

  • The recovery target is PVCs that can be directly mounted by applications. The data pointed to by this PVC is independent of the backup data. If the PVC is unintentionally deleted, restoring from the backup center will create a new cloud disk with the same data as it was backed up. All mounting parameters of the PVC are unchanged except for the cloud disk instance it points to. This allows applications to directly mount the restored PVC.

  • Data replication and data disaster recovery scenarios.