All Products
Search
Document Center

Container Service for Kubernetes:FAQ about the backup center

Last Updated:Nov 21, 2023

This topic provides answers to some frequently asked questions about the backup center.

Table of contents

Note

If you want to use the CLI to access the backup center, we recommend that you update the backup component migrate-controller to the latest version before you perform troubleshooting. The update does not affect the existing backups.

What do I do if the migrate-controller component in a cluster that uses FlexVolume cannot be launched?

The migrate-controller component does not support clusters that use FlexVolume. If you want to use the backup center feature, you can use one of the following methods to switch from FlexVolume to Container Storage Interface (CSI):

What do I do if the status of the backup, restore, or snapshot conversion task is Failed, and the VaultError: backup vault is unavailable: xxx error is displayed?

Issue

The status of the backup, restore, or snapshot conversion task is Failed, and the VaultError: backup vault is unavailable: xxx error is displayed.

Cause

  • The specified Object Storage Service (OSS) bucket does not exist.

  • The cluster does not have permissions to access OSS.

  • The network of the OSS bucket is unreachable.

Solution

  1. Log on to the OSS console. Check whether the OSS bucket that is associated with the backup vault exists.

    If the OSS bucket does not exist, create one and associate it with the backup vault. For more information, see Create buckets.

  2. Check whether the cluster has permissions to access OSS.

    • Container Service for Kubernetes (ACK) Pro clusters: No OSS permissions are required. Make sure that the name of the backup vault is in the cnfs-oss-** format.

    • ACK dedicated clusters and registered clusters: OSS permissions are required. For more information, see Install migrate-controller and grant permissions.

    ACK managed clusters of earlier versions may not have permissions to access OSS. You can run the following command to check whether a cluster has permissions to access OSS:

    kubectl get secret -n kube-system | grep addon.csi.token

    Expected output:

    addon.csi.token          Opaque                      1      62d

    If the returned content is the same as the preceding expected output, the cluster has permissions to access OSS. You only need to specify an OSS bucket that is named in the cnfs-oss-* format for the cluster.

    If the returned content is different from the preceding expected output, the cluster does not have permissions to access OSS. You must grant the cluster permissions to access OSS. For more information, see Install migrate-controller and grant permissions.

    Note

    You cannot create a backup vault that uses the same name as a deleted one. You cannot associate a backup vault with an OSS bucket that is not named in the cnfs-oss-** format. If your backup vault is already associated with an OSS bucket that is not named in the cnfs-oss-** format, create another backup vault that uses a different name and associate the backup vault with an OSS bucket whose name meets the requirement.

  3. Run the following command to check the network configuration of the cluster:

    kubectl get backuplocation <backuplocation-name> -n csdr -o yaml | grep network

    Expected output:

    network: internal
    • If the value of the network field in the output is internal, the backup vault accesses the OSS bucket over an internal network.

    • If the value of the network field in the output is public, the backup vault accesses the OSS bucket over the Internet.

    In the following scenarios, you must configure the backup vault to access the OSS bucket over the Internet:

    • The cluster and OSS bucket are deployed in different regions.

    • The cluster is an ACK Edge cluster.

    • The cluster is a registered cluster and is not connected to a virtual private cloud (VPC) through Cloud Enterprise Network (CEN), Express Connect, or VPN connections, or the cluster is a registered cluster connected to a VPC but no route points to the internal network of the region where the OSS bucket resides. In this case, you must configure a route that points to the internal network of the region where the OSS bucket resides.

    To configure the cluster to access the OSS bucket over the Internet, run the following command to enable Internet access for the OSS bucket. Replace <backuplocation-name> with the actual backup vault name and <region-id> with the region ID of the OSS bucket, such as cn-hangzhou.

    kubectl patch -ncsdr backuplocation/<backuplocation-name> --type='json' -p   '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'
    kubectl patch -ncsdr backupstoragelocation/<backuplocation-name> --type='json' -p   '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'

What do I do if the status of the backup, restore, or snapshot conversion task is Failed, and the backup location is not ok, please maybe check access oss error is returned?

Issue

The status of the backup, restore, or snapshot conversion task is Failed and the backup location is not ok, please maybe check access oss error is returned.

Cause

Kubernetes versions earlier than 1.20

Possible causes:

Kubernetes 1.20 and later

The version of migrate-controller is outdated.

Solution

Kubernetes versions earlier than 1.20

  • The OSS subdirectory that is associated with a backup vault cannot be a parent or child directory of the OSS subdirectory that is associated with another backup vault. In addition, the OSS subdirectories that are associated with backup vaults can store only backups generated by the backup center. Run the following command to check the data in the OSS directories. Replace <backuplocation-name> with the actual backup vault name.

    kubectl describe backupstoragelocation <backuplocation-name> -n csdr | grep message

    Expected output:

    Backup store contains invalid top-level directories: ****

    The output indicates that other data is stored in the OSS directories associated with the backup vault. Solutions:

    • Update the Kubernetes version of the cluster to Kubernetes 1.20 or later and update migrate-controller to the latest version.

    • Create a backup vault that is not associated with an OSS subdirectory and rename the backup vault. Do not delete the backup vaults that have the same name.

Kubernetes 1.20 and later

Update migrate-controller to the latest version. For more information, see Manage components.

What do I do if the backup, restore, or snapshot conversion task remains in the Inprogress state for a long period of time?

Cause 1: The components in the csdr namespace cannot run as expected

Check the status of the components and identify the cause of the anomaly.

  1. Run the following command to check whether the components in the csdr namespace are restarted or cannot be launched:

    kubectl get pod -n csdr
  2. Run the following command to identify the cause of the restart or launch failure.

    kubectl describe pod <pod-name> -n csdr
  • If the components are restarted due to an out of memory (OOM) error, perform the following steps:

    Run the following command to modify the resource limit of the Deployment. Set <deploy-name> of csdr-controller-*** to csdr-controller and set <deploy-name> of csdr-velero-*** to csdr-velero.

    kubectl patch deploy  <deploy-name> -p '{"spec":{"containers":{"resources":{"limits":"<new-limit-memory>"}}}}'
  • If the components cannot be launched due to insufficient Hybrid Backup Recovery (HBR) permissions, perform the following steps:

    1. Check whether HBR is activated for the cluster.

      • If HBR is not activated, activate the service. For more information, see HBR.

      • If HBR is activated, proceed with the next step.

    2. Check whether the ACK Pro cluster or registered cluster has HBR permissions.

    3. Run the following command to check whether the token required by the HBR client exists.

      kubectl describe <hbr-client-***>

      If a couldnt find key HBR_TOKEN event is generated, the token does not exist. Perform the following steps to resolve the issue:

      1. Run the following command to query the node that hosts hbr-client-***:

        kubectl get pod <hbr-client-***> -n csdr -owide
      2. Run the following command to change the value of labels: csdr.alibabacloud.com/agent-enable from true to false for the node.

        kubectl label node <node-name> csdr.alibabacloud.com/agent-enable=false --overwrite
        Important
        • When the system reruns the backup or restore task, the system automatically creates a token and launches hbr-client.

        • You cannot launch hbr-client by copying a token from another cluster to the current cluster. You need to delete the copied token and the corresponding hbr-client-*** pod and repeat the preceding steps.

Cause 2: No permissions are granted to use disk snapshots in disk backup scenarios

If you back up the disk volume that is mounted to your application but the backup task remains in the Inprogress state for a long period of time, run the following command to query the newly created VolumeSnapshots in the cluster:

kubectl get volumesnapshot -n <backup-namespace>

Expected output:

NAME                    READYTOUSE      SOURCEPVC         SOURCESNAPSHOTCONTENT         ...
<volumesnapshot-name>   true                              <volumesnapshotcontent-name>  ...

If the READYTOUSE state of all VolumeSnapshots remains false for a long period of time, perform the following steps:

  1. Log on to the Elastic Compute Service (ECS) console and check whether the disk snapshot feature is enabled.

    • If the feature is disabled, enable the feature in the corresponding region. For more information, see Activate ECS Snapshot.

    • If the feature is enabled, proceed with the next step.

  2. Check whether the permissions to use disk snapshots are granted.

    1. Log on to the ACK console and click Clusters in the left-side navigation pane.

    2. On the Clusters page, click the name of the cluster that you want to manage and click Cluster Information in the left-side navigation pane.

    3. On the Cluster Information page, click the Cluster Resources tab and click the hyperlink to the right of Master RAM Role to go to the permission management page.

    4. On the Policies page, check whether the permissions to use disk snapshots are granted.

      • If the k8sMasterRolePolicy-Csi-*** policy exists and the policy provides the k8sMasterRolePolicy-Csi-*** and k8sMasterRolePolicy-Csi-*** permissions, the required permissions are granted. In this case, submit a ticket.

      • If the k8sMasterRolePolicy-Csi-*** policy does not exist, attach the following policy to the master RAM role to grant the permissions to use disk snapshots. For more information, see Create custom policies and Grant permissions to a RAM role.

        {
            "Version": "1",
            "Statement": [
                {
                    "Action": [
                        "ecs:DescribeDisks",
                        "ecs:DescribeInstances",
                        "ecs:DescribeAvailableResource",
                        "ecs:DescribeInstanceTypes",
                        "nas:DescribeFileSystems",
                        "ecs:AttachDisk",
                        "ecs:CreateDisk",
                        "ecs:CreateSnapshot",
                        "ecs:DeleteDisk",
                        "ecs:DeleteSnapshot",
                        "ecs:DetachDisk"
                    ],
                    "Resource": [
                        "*"
                    ],
                    "Effect": "Allow"
                }
            ]
        }
    5. If the issue persists after you perform the preceding steps, submit a ticket.

What do I do if the console displays the following error: Failed to retrieve the data. Refresh and try again? 404 page not found

Issue

The console displays the following error: Failed to retrieve the data. Refresh and try again. 404 page not found.

Cause

The relevant custom resource definitions (CRDs) fail to be deployed.

Solution

What do I do if the console displays the following error: The name is already used. Change the name and try again?

Issue

When a backup, restore, or snapshot conversion task is created or deleted, the console displays the following error: The name is already used. Change the name and try again.

Cause

When you delete a task in the console, a deletrequest resource is created in the cluster. The corresponding component performs multiple deletion operations, including deleting the backup resources. For more information about how to use kubectl to perform relevant operations, see Use kubectl to back up and restore data.

If an error occurred during the deletion process or while the deleterequest resource is being processed, some resources in the cluster cannot be deleted. As a result, the console displays an error message, indicating that resources with the same name already exist.

Solution

  • Delete the resources with the same name as prompted. For example, if the error deleterequests.csdr.alibabacloud.com "xxxxx-dbr" already exists occurs, you can run the following command to delete the resources with the same name:

    kubectl -n csdr delete deleterequests xxxxx-db
  • Create a task with a different name.

What do I do if the system prompts that no backup file can be selected when the system initializes the backup vault to restore an application across clusters?

Issue

The system prompts that no backup file can be selected when the system initializes the backup vault to restore an application across clusters.

Cause

The backup vault that you selected is not associated with your cluster. The system initializes the backup vault and synchronizes the basic information about the backup vault, including the OSS bucket information, to the cluster. Then, the system initializes the backup files from the backup vault in the cluster. You can select a backup file from the backup vault to restore the application only after the backup vault is initialized.

Solution

In the Create Restoration Task panel, click Initialize Backup Vault to the right of Backup Vaults, wait until the backup vault is initialized, and then select a backup file.

How do I modify a backup vault?

You cannot modify backup vaults in the backup center. If you want to modify a backup vault, delete the current backup vault and create a new one.

Backup vaults are shared resources. Existing backup vaults may be in the Backing Up or Restoring state. If you modify a parameter of the backup vault, the system may fail to find the required data when backing up or restoring an application. Therefore, backup vaults cannot be modified.

Important
  • If a backup vault has never been used before, you can delete the backup vault and then create a new one with the same name.

  • If the backup vault has been used to back up or restore data, the preceding method does not apply. This ensures that the backup or restore task can be completed without errors.

What do I do if the status of the backup task is Failed and the "PROCESS velero failed err: VeleroError: xxx" error is returned?

Issue

The backup task failed and the "PROCESS velero failed err: VeleroError: xxx" error is returned.

Cause

During the backup process, the pod named csdr-velero-**** in the csdr namespace encounters an error, such as the OOMKilled error. During application backup, csdr-velero may reach the memory upper limit. If you back up a large number of objects, the OOMKilled error may occur.

Solution

For more information about how to resolve the issue, see What do I do if the backup, restore, or snapshot conversion task remains in the Inprogress state for a long period of time?

What do I do if the status of the restore task is Completed but some resources are not created in the cluster?

Issue

The status of the restore task is Completed but some resources are not created in the cluster.

Cause

  • The resource is not backed up.

  • The cluster in which the resource is to be restored contains a resource with the same name. As a result, the restore task skipped the resource.

  • Some restore tasks failed due to port conflicts or the underlying cloud resource dependencies.

Solution

  1. Run the following command to check whether the resource is backed up:

    In the following command, <backup> specifies the name of the backup task, <ns> specifies the namespace where the resource resides, and <resource-name> specifies the name of the resource.

    kubectl -n csdr describe configmap <backup> | grep "<ns>/<resource-name>"

    If no output is returned, the resource is not backed up. You can back up the resource and restore it. You can also manually deploy the resource.

    Reasons why the resource is not backed up:

    • The resource is not created or is in an abnormal state when you perform the backup operation.

    • The resource is in the namespace or resource exclusion list that you configured, or the resource is not included in the namespace, resource, or resource label inclusion list that you configured.

    • The resource is a cluster scoped resource that is not used by any application.

  2. Check whether a resource with the same name already exists in the cluster. If you want to overwrite the resource, you can delete it and then restore it.

  3. Run the following command to query the name of the Velero component:

    kubectl -n csdr get pod | grep csdr-velero | awk '{print $1}'

    Expected output:

    csdr-velero-75996bbdb8-gddng
  4. Run the following command to query the applications that failed to be restored and the cause:

    In the following command, <restore> specifies the name of the restore task.

    kubectl -n csdr exec -it csdr-velero-75996bbdb8-gddng -- /velero describe restore <restore>

    Expected output:

    ...
    Warnings:
      Velero:     <none>
      Cluster:  could not restore, CustomResourceDefinition "volumesnapshots.snapshot.storage.k8s.io" already exists. Warning: the in-cluster version is different than the backed-up version.
      Namespaces: <none>
    
    Errors:
      Velero:     <none>
      Cluster:    <none>
      Namespaces:
        <ns>:  error restoring services/<ns>/<service>: Internal error occurred: failed to allocate requested HealthCheck NodePort 32578: provided port is already allocated
        ...
    ...

    <ns> indicates the namespace of the resource to be restored. In this example, the type of the resource that failed to be restored is services. <service> indicates the name of the resource.

You can troubleshoot this issue based on the cause. The cause of failure in this example is Internal error occurred: failed to allocate requested HealthCheck NodePort 32578: provided port is already allocated. The health check port is automatically changed if the cluster that runs the backup task and the cluster that runs the restore task are the same. If the two clusters are different, the health check port remains unchanged.

Important

If the Service that is backed up is a LoadBalancer Service created based on an existing Server Load Balancer (SLB) instance, the SLB instance will be reused after the Service is restored and the Overwrite Existing Listeners feature will be disabled. For more information, see Use an existing SLB instance to expose an application.