This topic provides answers to some frequently asked questions about the backup center.
Table of contents
Common operations
If you use the backup center with kubectl, upgrade the migrate-controller component to the latest version before you troubleshoot issues. The upgrade does not affect existing backups. For more information about how to upgrade the component, see Manage components.
When the status of a backup task, StorageClass conversion task, or restore task is Failed or PartiallyFailed, you can obtain error messages by using the following methods:
Move the pointer over Failed or PartiallyFailed in the Status column to view a brief error message, such as
RestoreError: snapshot cross region request failed
.To obtain more detailed error messages, run the following commands to query the events of the task, such as
RestoreError: process advancedvolumesnapshot failed avs: snapshot-hz, err: transition canceled with error: the ECS-snapshot related ram policy is missing
.Backup task
kubectl -n csdr describe applicationbackup <backup-name>
StorageClass conversion task
kubectl -n csdr describe converttosnapshot <backup-name>
Restore task
kubectl -n csdr describe applicationrestore <restore-name>
The console displays "The working component is abnormal" or "Failed to fetch current data"
Issue
The console displays The Working Component Is Abnormal or Failed To Fetch Current Data.
Cause
The installation of the backup center component is abnormal.
Solution
Check whether nodes that belong to the cluster exist. If nodes that belong to the cluster do not exist, the backup center cannot be deployed.
Check whether the cluster uses FlexVolume. If the cluster uses FlexVolume, switch to CSI. For more information, see The migrate-controller component in a cluster that uses FlexVolume cannot be launched.
If you use the backup center with kubectl, check whether the YAML configurations are correct. For more information, see Use kubectl to back up and restore applications.
If your cluster is an ACK dedicated cluster or a registered cluster, check whether the required permissions are granted. For more information, see ACK dedicated cluster and Registered cluster.
Check whether the csdr-controller and csdr-velero Deployments in the csdr namespace fail to be deployed due to resource or scheduling limits. If yes, fix the issue.
The console displays the following error: The name is already used. Change the name and try again
Issue
When you create or delete a backup task, StorageClass conversion task, or restore task, the console displays The Name Is Already Used. Change The Name And Try Again.
Cause
When you delete a task in the console, a deleterequest
resource is created in the cluster. The working component performs a series of deletion operations, not just deleting the corresponding backup resource. The same applies to command line operations. For more information, see Use kubectl to back up and restore applications.
If the deletion operation is incorrect or an error occurs during the processing of the deleterequest
resource, some resources in the cluster cannot be deleted. In this case, the error message that indicates the existence of resources with the same name is returned.
Solution
Delete the resources with the same name as prompted. For example, if the error message
deleterequests.csdr.alibabacloud.com "xxxxx-dbr" already exists
is returned, run the following command to delete the resource:kubectl -n csdr delete deleterequests xxxxx-dbr
Create a task with a new name.
I cannot select an existing backup when I restore an application across clusters
Issue
I cannot select a backup task when I restore an application across clusters.
Cause
Cause 1: The backup vault is not associated with the current cluster, which means that the backup vault is not initialized.
The system initializes the backup vault and synchronizes the basic information about the backup vault, including the Object Storage Service (OSS) bucket information, to the cluster. Then, the system initializes the backup files from the backup vault in the cluster. You can select a backup file from the backup vault for restoration only after the initialization is complete.
Cause 2: The initialization of the backup vault fails. The status of the backuplocation resource in the current cluster is
Unavailable
.Cause 3: The backup task is not complete or the backup task fails.
Solution
Solution 1:
On the Create Restore Task page, click Initialize Vault on the right side of Backup Vault. After the backup vault is initialized, select the task that you want to restore.
Solution 2:
Run the following command to check the status of the backuplocation resource:
kubectl get -n csdr backuplocation <backuplocation-name>
Expected result:
NAME PHASE LAST VALIDATED AGE
<backuplocation-name> Available 3m36s 38m
If the status is Unavailable
, see the solution in The status of the task is Failed and the "VaultError: xxx" error is returned.
Solution 3:
In the console of the backup cluster, check whether the backup task is successful, which means that the status of the backup task is Completed. If the status of the backup task is abnormal, troubleshoot the issue. For more information, see Table of contents.
The console displays "The service role required by the current component has not been authorized"
Issue
When you access the application backup console, the console displays "The service role required by the current component has not been authorized" and the error code AddonRoleNotAuthorized is returned.
Cause
The cloud resource authentication logic of the migrate-controller component in ACK managed clusters is optimized in migrate-controller 1.8.0. When you install or upgrade the component to this version for the first time, the Alibaba Cloud account must complete cloud resource authorization.
Solution
If you are logged on with an Alibaba Cloud account, click Authorize to complete the authorization.
If you are logged on with a RAM user, click Copy Authorization Link and send the link to the Alibaba Cloud account to complete the authorization.
The console displays "The current account has not been granted the cluster RBAC permissions required for this operation"
Issue
When you access the application backup console, the console displays "The current account has not been granted the cluster RBAC permissions required for this operation. Contact the primary account or permission administrator for authorization." The error code is APISERVER.403.
Cause
The console interacts with the API server to submit backup and restore tasks and obtain real-time task status. The default permission list for cluster O&M personnel and developers lacks some permissions required by the backup center component. The primary account or permission administrator needs to grant these permissions.
Solution
Refer to Use custom RBAC roles to restrict resource operations in a cluster and grant the following ClusterRole permissions to backup center operators:
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: csdr-console
rules:
- apiGroups: ["csdr.alibabacloud.com","velero.io"]
resources: ['*']
verbs: ["get","create","delete","update","patch","watch","list","deletecollection"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get","list"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get","list"]
The backup center component fails to be upgraded or uninstalled
Issue
The backup center component fails to be upgraded or uninstalled, and the csdr namespace remains in the Terminating state.
Cause
The backup center component exits abnormally during operation, leaving tasks in the InProgress state in the csdr namespace. The finalizers
field of these tasks may prevent resources from being deleted smoothly, causing the csdr namespace to remain in the Terminating state.
Solution
Run the following command to check why the csdr namespace is in the Terminating state:
kubectl describe ns csdr
Confirm that the stuck tasks are no longer needed and delete their corresponding
finalizers
.After confirming that the csdr namespace is deleted:
For component upgrade scenarios, you can reinstall the migrate-controller component of the backup center.
For component uninstallation scenarios, the component should already be uninstalled.
The status of the task is failed and the "internal error" error is returned
Issue
The status of the task is Failed and the "internal error" error is returned.
Cause
The component or underlying cloud service encounters an unexpected exception, such as when the cloud service is not available in the current region.
Solution
If the error message is "HBR backup/restore internal error", check whether the container backup feature is available in the Cloud Backup console.
For more questions of this type, please submit a ticket for processing.
The status of the task is failed and the "create cluster resources timeout" error is returned
Issue
The status of the task is Failed and the "create cluster resources timeout" error is returned.
Cause
During StorageClass conversion or restoration, temporary pods, persistent volume claims (PVCs), and persistent volumes (PVs) may be created. If these resources remain unavailable for a long time after they are created, the "create cluster resources timeout" error is returned.
Solution
Run the following command to locate the abnormal resource and find the cause based on the events:
kubectl -n csdr describe <applicationbackup/converttosnapshot/applicationrestore> <task-name>
Expected result:
……wait for created tmp pvc default/demo-pvc-for-convert202311151045 for convertion bound time out
This indicates that the PVC used for StorageClass conversion remains unbound for a long time. The PVC is in the
default
namespace and is nameddemo-pvc-for-convert202311151045
.Run the following command to check the status of the PVC and identify the cause of the issue:
kubectl -ndefault describe pvc demo-pvc-for-convert202311151045
The following are common causes of issues in the backup center. For more information, see Storage troubleshooting.
The cluster or node resources are insufficient or abnormal.
The restore cluster does not have the corresponding StorageClass. Use the StorageClass conversion feature to select an existing StorageClass in the restore cluster and then restore the application.
The underlying storage associated with the StorageClass is unavailable. For example, the specified disk type is not supported in the current zone.
The Container Network File System (CNFS) associated with alibabacloud-cnfs-nas is abnormal. For more information, see Use CNFS to manage NAS file systems (recommended).
When you restore applications in a multi-zone cluster, you select a StorageClass whose volumeBindingMode is set to Immediate.
The status of the task is failed and the "addon status is abnormal" error is returned
Issue
The status of the task is Failed and the "addon status is abnormal" error is returned.
Cause
The components in the csdr namespace are abnormal.
Solution
See Cause 1 and solution: The components in the csdr namespace are abnormal.
The status of the task is failed and the "VaultError: xxx" error is returned
Issue
The status of the backup, restore, or StorageClass conversion task is Failed and the error message VaultError: backup vault is unavailable: xxx is returned.
Cause
The specified OSS bucket does not exist.
The cluster does not have the permissions to access OSS.
The network of the OSS bucket is unreachable.
Solution
Log on to the OSS console and check whether the OSS bucket associated with the backup vault exists.
If the OSS bucket does not exist, create a bucket and re-associate it. For more information, see Create buckets.
Check whether the cluster has the permissions to access OSS.
ACK Pro cluster: You do not need to configure OSS permissions. Make sure that the name of the OSS bucket associated with the backup vault of the cluster starts with cnfs-oss-**.
ACK dedicated cluster and registered cluster: You must configure OSS permissions. For more information, see Install migrate-controller and grant permissions.
For ACK managed clusters that are not installed or upgraded to v1.8.0 or later by using the console, OSS-related permissions may be missing. You can run the following command to check whether the cluster has the permissions to access OSS:
kubectl get secret -n kube-system | grep addon.aliyuncsmanagedbackuprestorerole.token
Expected result:
addon.aliyuncsmanagedbackuprestorerole.token Opaque 1 62d
If the returned content is the same as the preceding expected output, the cluster has permissions to access OSS. You only need to specify an OSS bucket that is named in the cnfs-oss-* format for the cluster.
If the returned content is not the same as the expected output, use one of the following methods to grant permissions:
Refer to the ACK dedicated cluster and registered cluster section to configure OSS permissions. For more information, see Install migrate-controller and grant permissions.
Use an Alibaba Cloud account to click Authorize to complete the authorization. You need to perform this operation only once for each Alibaba Cloud account.
NoteYou cannot create a backup vault that uses the same name as a deleted one. You cannot associate a backup vault with an OSS bucket whose name is not in the cnfs-oss-** format. If you have associated a backup vault with an OSS bucket whose name is not in the cnfs-oss-** format, create a backup vault with a different name and associate it with an OSS bucket whose name is in the cnfs-oss-** format.
Run the following command to check the network configurations of the cluster:
kubectl get backuplocation <backuplocation-name> -n csdr -o yaml | grep network
The output is similar to the following content:
network: internal
If the value of network is
internal
, the backup vault accesses the OSS bucket over the internal network.If the value of network is
public
, the backup vault accesses the OSS bucket over the Internet. If the backup vault accesses the OSS bucket over the Internet and the error message indicates a timeout, check whether the cluster can access the Internet. For more information, see Enable an existing ACK cluster to access the Internet.
In the following scenarios, the backup vault must access the OSS bucket over the Internet:
The cluster and the OSS bucket are deployed in different regions.
The current cluster is an ACK Edge cluster.
The current cluster is a registered cluster and is not connected to a virtual private cloud (VPC) by using Cloud Enterprise Network (CEN), Express Connect, or VPN Gateway. Alternatively, the cluster is connected to a VPC but no route is configured to point to the internal OSS endpoint of the region. You must configure a route to point to the internal OSS endpoint of the region.
For more information about how to connect an on-premises data center to a VPC, see Methods that are used to connect data centers to Alibaba Cloud.
For more information about the mapping between internal OSS endpoints and virtual IP address (VIP) CIDR blocks, see Internal OSS endpoints and VIP ranges.
If the backup vault must access the OSS bucket over the Internet, run the following commands to change the access method to Internet access. In the following code,
<backuplocation-name>
specifies the name of the backup vault and<region-id>
specifies the region where the OSS bucket is deployed, such as cn-hangzhou.kubectl patch -n csdr backuplocation/<backuplocation-name> --type='json' -p '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]' kubectl patch -n csdr backupstoragelocation/<backuplocation-name> --type='json' -p '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'
The status of the task is Failed and the "HBRError: check HBR vault error" error is returned
Issue
The status of the backup, restore, or StorageClass conversion task is Failed and the "HBRError: check HBR vault error" error is returned.
Cause
Cloud Backup is not activated or does not have the required permissions.
Solution
Check whether Cloud Backup is activated. For more information, see Activate Cloud Backup.
If your cluster resides in China (Ulanqab), China (Heyuan), or China (Guangzhou), you must grant Cloud Backup the permissions to access API Gateway after you activate Cloud Backup. For more information, see Step 3 (optional): Authorize Cloud Backup to access API Gateway .
If your cluster is an ACK dedicated cluster or registered cluster, make sure that the Resource Access Management (RAM) user you use have the permissions to access Cloud Backup. For more information about how to perform the authorization, see Install migrate-controller and grant permissions.
The status of the task is Failed and the "hbr task finished with unexpected status: FAILED, errMsg ClientNotExist" error is returned
Issue
The status of the backup, restore, or StorageClass conversion task is Failed and the error message hbr task finished with unexpected status: FAILED, errMsg ClientNotExist is returned.
Cause
The Cloud Backup client is abnormally deployed on the corresponding node, which means that the replica of the hbr-client DaemonSet on the node in the csdr namespace is abnormal.
Solution
Run the following command to check whether abnormal hbr-client pods exist in the cluster:
kubectl -n csdr get pod -lapp=hbr-client
If pods are in an abnormal state, first check whether the issue is caused by insufficient pod IP addresses, memory, or CPU resources. If the status of a pod is CrashLoopBackOff, run the following command to view the logs of the pod:
kubectl -n csdr logs -p <hbr-client-pod-name>
If the output contains "SDKError:\n StatusCode: 403\n Code: MagpieBridgeSlrNotExist\n Message: code: 403, AliyunServiceRoleForHbrMagpieBridge doesn't exist, please create this role. ", see Step 3 (optional): Authorize Cloud Backup to access API Gateway to grant permissions to Cloud Backup.
If the log output contains other types of SDK errors, submit a ticket for processing.
The task remains in the InProgress state for a long period of time
Cause 1 and solution: The components in the csdr namespace are abnormal
Check the status of the components and identify the cause of the anomaly.
Run the following command to check whether the components in the csdr namespace are restarting or cannot be started:
kubectl get pod -n csdr
Run the following command to check why the components are restarting or cannot be started:
kubectl describe pod <pod-name> -n csdr
If the cause is OOM restart
If the OOM exception occurs during restoration, the pod is csdr-velero-***, and many applications are running in the restore cluster, such as dozens of production namespaces. The OOM exception may occur because Velero uses the Informer Cache to accelerate the restore process by default, and the cache occupies some memory.
If the number of resources to be restored is small or you can accept some performance impact during restoration, you can run the following command to disable the Informer Cache feature:
kubectl -nkube-system edit deploy migrate-controller
Add the parameter
--disable-informer-cache=true
to theargs
of the migrate-controller container:name: migrate-controller args: - --disable-informer-cache=true
For other cases, or if you do not want to reduce the speed of cluster resource restoration, run the following command to adjust the Limit value of the corresponding Deployment.
For
csdr-controller-***
,<deploy-name>
iscsdr-controller
. Forcsdr-velero-***
,<deploy-name>
iscsdr-velero
.kubectl patch deploy <deploy-name> -p '{"spec":{"containers":{"resources":{"limits":"<new-limit-memory>"}}}}'
If the cause is HBR permissions are not configured, which causes the launch to fail
Confirm that the cluster has activated the Cloud Backup service.
Not activated: Please activate Cloud Backup service. For more information, see Cloud Backup.
If Cloud Backup is activated, proceed with the next step.
Confirm that the ACK dedicated cluster and registered cluster have Cloud Backup permissions configured.
Not configured: Configure Cloud Backup permissions. For more information, see Install migrate-controller and grant permissions.
If Cloud Backup permissions are configured, proceed with the next step.
Execute the following command to confirm whether the token required by the Cloud Backup Client widget exists.
kubectl describe <hbr-client-***>
If the event error message couldn't find key HBR_TOKEN is returned, the token is missing. Perform the following steps to resolve the issue:
Run the following command to query the node where
hbr-client-***
is located:kubectl get pod <hbr-client-***> -n csdr -owide
Run the following command to change the
labels: csdr.alibabacloud.com/agent-enable
of the corresponding node fromtrue
tofalse
:kubectl label node <node-name> csdr.alibabacloud.com/agent-enable=false --overwrite
ImportantWhen you back up and restore applications again, the token is automatically created and hbr-client is launched.
If you copy the token from another cluster to the current cluster, the hbr-client that is started will not be active. You need to delete the copied token and the
hbr-client-*** pod
that is started by this token, and then execute the preceding steps.
Cause 2 and solution: Cluster snapshot permissions are not configured for disk volume backup
When you back up the disk volume that is mounted to your application, if the backup task remains in the InProgress state for a long period of time, run the following command to query the newly created VolumeSnapshot resources in the cluster:
kubectl get volumesnapshot -n <backup-namespace>
Sample output:
NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT ...
<volumesnapshot-name> true <volumesnapshotcontent-name> ...
If the READYTOUSE
field of all volumesnapshot
resources remains false
for a long time, perform the following steps:
Log on to the ECS console and check whether the disk snapshot feature is enabled.
If the disk snapshot feature is not enabled, enable the disk snapshot feature in the corresponding region. For more information, see Activate ECS Snapshot.
If the disk snapshot feature is enabled, proceed with the next step.
Check whether the Container Storage Interface (CSI) component of the cluster runs as normal.
kubectl -nkube-system get pod -l app=csi-provisioner
Check whether permissions to use disk snapshots are configured.
ACK managed cluster
Log on to the RAM console as a RAM user who has administrative rights.
In the left-side navigation pane, choose .
On the Roles page, search for AliyunCSManagedBackupRestoreRole in the search box and check whether the authorization policy of the role contains the following policy content:
{ "Statement": [ { "Effect": "Allow", "Action": [ "hbr:CreateVault", "hbr:CreateBackupJob", "hbr:DescribeVaults", "hbr:DescribeBackupJobs2", "hbr:DescribeRestoreJobs", "hbr:SearchHistoricalSnapshots", "hbr:CreateRestoreJob", "hbr:AddContainerCluster", "hbr:DescribeContainerCluster", "hbr:CancelBackupJob", "hbr:CancelRestoreJob", "hbr:DescribeRestoreJobs2" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "ecs:CreateSnapshot", "ecs:DeleteSnapshot", "ecs:DescribeSnapshotGroups", "ecs:CreateAutoSnapshotPolicy", "ecs:ApplyAutoSnapshotPolicy", "ecs:CancelAutoSnapshotPolicy", "ecs:DeleteAutoSnapshotPolicy", "ecs:DescribeAutoSnapshotPolicyEX", "ecs:ModifyAutoSnapshotPolicyEx", "ecs:DescribeSnapshots", "ecs:DescribeInstances", "ecs:CopySnapshot", "ecs:CreateSnapshotGroup", "ecs:DeleteSnapshotGroup" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "oss:PutObject", "oss:GetObject", "oss:DeleteObject", "oss:GetBucket", "oss:ListObjects", "oss:ListBuckets", "oss:GetBucketStat" ], "Resource": "acs:oss:*:*:cnfs-oss*" } ], "Version": "1" }
If the AliyunCSManagedBackupRestoreRole role does not exist, go to the RAM Quick Authorization page to create the RAM role.
If the AliyunCSManagedBackupRestoreRole role exists but the policy content is incomplete, grant the preceding permissions to the role. For more information, see Create custom policies and Grant permissions to a RAM role.
ACK dedicated cluster
Log on to the Container Service for Kubernetes (ACK) console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, click Cluster Information.
On the Cluster Information page, find the Master RAM Role parameter and click the link on the right.
On the Permissions tab, check whether the disk snapshot permissions are normal.
If the k8sMasterRolePolicy-Csi-*** policy does not exist or does not include the following permissions, attach the following disk snapshot policy to the master RAM role. For more information, see Create custom policies and Grant permissions to a RAM role.
{ "Statement": [ { "Effect": "Allow", "Action": [ "hbr:CreateVault", "hbr:CreateBackupJob", "hbr:DescribeVaults", "hbr:DescribeBackupJobs2", "hbr:DescribeRestoreJobs", "hbr:SearchHistoricalSnapshots", "hbr:CreateRestoreJob", "hbr:AddContainerCluster", "hbr:DescribeContainerCluster", "hbr:CancelBackupJob", "hbr:CancelRestoreJob", "hbr:DescribeRestoreJobs2" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "ecs:CreateSnapshot", "ecs:DeleteSnapshot", "ecs:DescribeSnapshotGroups", "ecs:CreateAutoSnapshotPolicy", "ecs:ApplyAutoSnapshotPolicy", "ecs:CancelAutoSnapshotPolicy", "ecs:DeleteAutoSnapshotPolicy", "ecs:DescribeAutoSnapshotPolicyEX", "ecs:ModifyAutoSnapshotPolicyEx", "ecs:DescribeSnapshots", "ecs:DescribeInstances", "ecs:CopySnapshot", "ecs:CreateSnapshotGroup", "ecs:DeleteSnapshotGroup" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "oss:PutObject", "oss:GetObject", "oss:DeleteObject", "oss:GetBucket", "oss:ListObjects", "oss:ListBuckets", "oss:GetBucketStat" ], "Resource": "acs:oss:*:*:cnfs-oss*" } ], "Version": "1" }
After the permission configuration is complete, if the problem is still not resolved, please submit a ticket for processing.
Registered cluster
Only registered clusters whose nodes are all Alibaba Cloud Elastic Compute Service (ECS) instances can use the disk snapshot feature. Check whether the related permissions are granted when you install the CSI storage plug-in. For more information, see Step 1: Grant a RAM user the permissions to manage the CSI plug-in.
Cause 3 and solution: Storage volumes other than disk volumes are used
In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. Backups of other volume types cannot be restored across regions. If you are using a storage service that supports Internet access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application. For more information, see Mount a statically provisioned ossfs 1.0 volume.
The status of the backup task is failed and the "backup already exists in OSS bucket" error is returned
Issue
The status of the backup task is Failed and the "backup already exists in OSS bucket" error is returned.
Cause
A backup with the same name is stored in the OSS bucket associated with the backup vault.
A backup may be invisible in the current cluster due to the following reasons:
Backups in ongoing backup tasks and failed backup tasks are not synchronized to other clusters.
If you delete a backup in a cluster other than the backup cluster, the backup file in the OSS bucket is labeled but not deleted. The labeled backup file will not be synchronized to newly associated clusters.
The current cluster is not associated with the backup vault that stores the backup, which means that the backup vault is not initialized.
Solution
Create a backup vault with a new name.
The status of the backup task is failed and the "get target namespace failed" error is returned
Issue
The status of the backup task is Failed and the "get target namespace failed" error is returned.
Cause
In most cases, this error occurs in backup tasks that are created at a scheduled time. The cause varies based on the way how you select namespaces.
If you select Include, all the selected namespaces are deleted.
If you select Exclude, no namespace other than the selected namespaces exists in the cluster.
Solution
Modify the backup plan to change the method that is used to select namespaces and change the namespaces that you have selected.
The status of the backup task is failed and the "velero backup process timeout" error is returned
Issue
The status of the backup task is Failed and the "velero backup process timeout" error is returned.
Cause
Cause 1: The subtask of the application backup times out. The duration of a subtask varies based on the amount of cluster resources and the response latency of the API server. In migrate-controller 1.7.7 and later, the default timeout period of subtasks is 60 minutes.
Cause 2: The storage class of the bucket used by the backup vault is Archive Storage, Cold Archive, or Deep Cold Archive. To ensure data consistency during the backup process, files that record metadata must be updated by the backup center component on the OSS server. The backup center component cannot update files that are not restored.
Solution
Solution 1: Modify the global configuration of the subtask timeout period in the backup cluster.
Run the following command to add the
velero_timeout_minutes
configuration item to applicationBackup. The unit is minutes.kubectl edit -n csdr cm csdr-config
For example, the following code block sets the timeout period to 100 minutes:
apiVersion: v1 data: applicationBackup: | ... #Details not shown. velero_timeout_minutes: 100
After you modify the timeout period, run the following command to restart csdr-controller for the modification to take effect:
kubectl -n csdr delete pod -l control-plane=csdr-controller
Solution 2: Change the storage class of the bucket used by the backup vault to Standard.
If you want to store backup data in Archive Storage, you can configure a lifecycle rule to automatically convert the storage class and restore the data before restoration. For more information, see Convert storage classes.
The status of the backup task is failed and the "HBR backup request failed" error is returned
Issue
The status of the backup task is Failed and the "HBR backup request failed" error is returned.
Cause
Cause 1: The storage plug-in used by the cluster is not compatible.
Cause 2: Cloud Backup does not support backing up volumes whose volumeMode is Block. For more information, see Volume Mode.
Cause 3: The Cloud Backup client is abnormal, which causes the backup or restore task for file system volumes, such as OSS volumes, File Storage NAS (NAS) volumes, Cloud Parallel File Storage (CPFS) volumes, or local volumes, to time out or fail.
Solution
Solution 1: If your cluster uses a non-Alibaba Cloud CSI storage plug-in, or if the persistent volume (PV) is not a common Kubernetes storage volume such as NFS or LocalVolume, and you encounter compatibility issues, please submit a ticket for assistance.
Solution 2: Please submit a ticket for processing.
Solution 3: Perform the following steps:
Log on to the Cloud Backup console.
In the left-side navigation pane, choose Backup > Container Backup. On the Container Backup page, click the Backup Jobs tab.
In the top navigation bar, select a region.
On the Backup Jobs tab, select Job Name from the drop-down list next to the search box, enter
<backup-name>-hbr
in the search box, and click the search icon. After the backup task is displayed, you can view the task status. If the tasks is in an abnormal state, the cause of the anomaly is displayed. For more information, see Back up ACK clusters.NoteIf you want to query a StorageClass conversion task or backup task, search for the corresponding backup name.
The status of the backup task is failed and the "HBR get empty backup info" error is returned
Issue
The status of the backup task is Failed and the "HBR get empty backup info" error is returned.
Cause
In hybrid cloud scenarios, the backup center uses the standard Kubernetes volume mount path as the data backup path by default. For example, for the standard CSI storage driver, the default mount path is /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount
. The same applies to storage drivers that are officially supported by Kubernetes, such as NFS and FlexVolume.
In this case, /var/lib/kubelet
is the default kubelet root path. If you modify this path in your Kubernetes cluster, Cloud Backup may not be able to access the data to be backed up.
Solution
Log on to the node where the volume is mounted and perform the following steps to troubleshoot the issue:
Check whether the kubelet root path of the node is changed
Run the following command to query the kubelet startup command:
ps -elf | grep kubelet
If the startup command contains the
--root-dir
parameter, the value of this parameter is the kubelet root path.If the startup command contains the
--config
parameter, the value of this parameter is the kubelet configuration file. If the file contains theroot-dir
field, the value of this field is the kubelet root path.If the startup command does not contain root path information, query the content of the kubelet service startup file
/etc/systemd/system/kubelet.service
. If the file contains the EnvironmentFile field, such as:EnvironmentFile=-/etc/kubernetes/kubelet
The environment variable configuration file is
/etc/kubernetes/kubelet
. Query the content of the configuration file. If the file contains the following content:ROOT_DIR="--root-dir=/xxx"
The kubelet root path is /xxx.
If you cannot find any changes, the kubelet root path is the default path
/var/lib/kubelet
.
Run the following command to check whether the kubelet root path is a symbolic link to another path:
ls -al <root-dir>
If the output is similar to the following content:
lrwxrwxrwx 1 root root 26 Dec 4 10:51 kubelet -> /var/lib/container/kubelet
The actual root path is
/var/lib/container/kubelet
.Verify that the data of the target storage volume exists under the root path.
Make sure that the volume mount path
<root-dir>/pods/<pod-uid>/volumes
exists and that the subpath of the target type of storage volume exists under the path, such askubernetes.io~csi
orkubernetes.io~nfs
.Add the environment variable
KUBELET_ROOT_PATH = /var/lib/container/kubelet/pods
to the csdr/csdr-controller stateless application./var/lib/container/kubelet
is the actual kubelet root path that you obtained by querying the configuration and symbolic link.
The status of the backup task is failed and the "check backup files in OSS bucket failed" or "upload backup files to OSS bucket failed" or "download backup files from OSS bucket failed" error is returned
Issue
The status of the backup task is Failed and the "upload backup files to OSS bucket failed" error is returned.
Cause
The OSS server returns an error when the component checks, uploads, or downloads backup files in the OSS bucket associated with the backup vault. The issue may arise due to one of the following causes:
Cause 1: Data encryption is enabled for the OSS bucket, but the related KMS permissions are not granted.
Cause 2: Some read and write permissions are missing when you install the component and configure permissions for ACK dedicated clusters and registered clusters.
Cause 3: The authentication credential of the RAM user that is used to configure permissions for ACK dedicated clusters and registered clusters is revoked.
Solution
Solution 2: Check the permission policy of the RAM user that is used to configure permissions. For more information about the permission policy required by the component, see Step 1: Configure permissions.
Solution 3: Confirm whether the authentication credentials of the RAM user used for configuring permissions are enabled. If they have been revoked, obtain new authentication credentials and update the content of the Secret
alibaba-addon-secret
in the csdr namespace, then perform the following operations to restart the widget:kubectl -nkube-system delete pod -lapp=migrate-controller
The status of the backup task is PartiallyFailed and the "PROCESS velero partially completed" error is returned
Issue
The status of the backup task is PartiallyFailed and the "PROCESS velero partially completed" error is returned.
Cause
When you use the velero component to back up applications (resources in the cluster), the component fails to back up some resources.
Solution
Run the following command to identify the resources that fail to be backed up and the cause of the failure:
kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name>
Fix the issue based on the content in the Errors
and Warnings
fields in the output.
If no direct cause of the failure is displayed, run the following command to obtain the related exception logs:
kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero backup logs <backup-name>
If you cannot fix the problem based on the failure reason or abnormal logs, submit a ticket for processing.
The status of the backup task is PartiallyFailed and the "PROCESS hbr partially completed" error is returned
Issue
The status of the backup task is PartiallyFailed and the error message "PROCESS hbr partially completed" is returned.
Cause
When you use Cloud Backup to back up file system volumes, such as OSS volumes, NAS volumes, CPFS volumes, or local volumes, Cloud Backup fails to back up some resources. The issue may arise due to one of the following causes:
Cause 1: The storage plug-in used by some volumes is not supported.
Cause 2: Cloud Backup does not guarantee data consistency. If files are deleted during backup, the backup may fail.
Solution
Log on to the Cloud Backup console.
In the left-side navigation pane, choose Backup > Container Backup. On the Container Backup page, click the Backup Jobs tab.
In the top navigation bar, select a region.
On the Backup Jobs tab, select Job Name from the drop-down list next to the search box, enter
<backup-name>-hbr
in the search box, and click the search icon. If the volume backup fails or partially fails, you can view the casue. For more information, see Back up ACK clusters.
The status of the StorageClass conversion task is failed and the "storageclass xxx not exists" error is returned
Issue
The status of the StorageClass conversion task is Failed and the "storageclass xxx not exists" error is returned.
Cause
The target StorageClass that you select for StorageClass conversion does not exist in the current cluster.
Solution
Run the following command to reset the StorageClass conversion task:
cat << EOF | kubectl apply -f - apiVersion: csdr.alibabacloud.com/v1beta1 kind: DeleteRequest metadata: name: reset-convert namespace: csdr spec: deleteObjectName: "<backup-name>" deleteObjectType: "Convert" EOF
Create the desired StorageClass in the current cluster.
Run the restore task again and configure StorageClass conversion.
The status of the StorageClass conversion task is failed and the "only support convert to storageclass with CSI diskplugin or nasplugin provisioner" error is returned
Issue
The status of the StorageClass conversion task is Failed and the error message "only support convert to storageclass with CSI diskplugin or nasplugin provisioner" is returned.
Cause
The target StorageClass that you select for StorageClass conversion is not an Alibaba Cloud CSI disk volume or NAS volume.
Solution
The current version only supports snapshot creation and recovery for disk and NAS types by default. If you have other recovery requirements, please contact relevant support submit a ticket.
If you are using a storage service that supports Internet access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application without the StorageClass conversion step. For more information, see Mount a statically provisioned ossfs 1.0 volume.
The status of the StorageClass conversion task is failed and the "current cluster is multi-zoned" error is returned
Issue
The status of the StorageClass conversion task is Failed and the "current cluster is multi-zoned" error is returned.
Cause
The current cluster is a multi-zone cluster. The StorageClass to which the current StorageClass is converted is disk volume and the volumeBindingMode is set to Immediate. If you use disk volumes in a multi-zone cluster, pods cannot be scheduled to the specified node and remain in the Pending state after disk volumes are created and mounted to the pods. For more information about the volumeBindingMode field, see StorageClass.
Solution
Run the following command to reset the StorageClass conversion task:
cat << EOF | kubectl apply -f - apiVersion: csdr.alibabacloud.com/v1beta1 kind: DeleteRequest metadata: name: reset-convert namespace: csdr spec: deleteObjectName: "<backup-name>" deleteObjectType: "Convert" EOF
If you want to convert to a disk StorageClass:
If you use the console, select alicloud-disk. alicloud-disk uses the alicloud-disk-topology-alltype StorageClass by default.
If you use the command line, select the alicloud-disk-topology-alltype type. alicloud-disk-topology-alltype is the default StorageClass provided by the CSI storage plug-in. You can also set volumeBindingMode to WaitForFirstConsumer.
Run the restore task again and configure StorageClass conversion.
The status of the restore task is failed and the "multi-node writing is only supported for block volume" error is returned
Issue
The status of the restore or StorageClass conversion task is Failed and the error message "multi-node writing is only supported for block volume. For Kubernetes users, if unsure, use ReadWriteOnce access mode in PersistentVolumeClaim for disk volume" is returned.
Cause
To prevent the risk of forced disk detachment when a disk is mounted to another node during mounting, CSI checks the AccessModes configuration of disk volumes during mounting and prohibits the use of ReadWriteMany or ReadOnlyMany configuration.
The application to be backed up mounts a volume whose AccessMode is ReadWriteMany or ReadOnlyMany (mostly network storage that supports multiple mounts, such as OSS or NAS). When you restore the application to Alibaba Cloud disk storage that does not support multiple mounts by default, CSI may throw the preceding error.
Specifically, the following three scenarios may cause this error:
Scenario 1: The CSI version of the backup cluster is earlier (or the cluster uses the FlexVolume storage plug-in). Earlier CSI versions do not check the AccessModes field of Alibaba Cloud disk volumes during mounting, which causes the original disk volume to report an error when it is restored in a cluster with a later CSI version.
Scenario 2: The custom StorageClass used by the backup volume does not exist in the restore cluster. According to a certain matching rule, the volume is restored as an Alibaba Cloud disk volume by default in the new cluster.
Scenario 3: During restoration, you use the StorageClass conversion feature to manually specify that the backup volume is restored as an Alibaba Cloud disk volume.
Solution
Scenario 1: Starting from v1.8.4, the backup component supports automatic conversion of the AccessModes field of disk volumes to ReadWriteOnce. Upgrade the backup center component and then restore the application again.
Scenario 2: Automatic restoration of the StorageClass by the component in the target cluster may risk data inaccessibility or data overwriting. Create a StorageClass with the same name in the target cluster before restoration, or use the StorageClass conversion feature to specify the StorageClass to be used during restoration.
Scenario 3: When you restore a network storage volume as a disk volume, configure the convertToAccessModes parameter to convert AccessModes to ReadWriteOnce. For more information, see convertToAccessModes: the list of target AccessModes.
The status of the restore task is failed and the "only disk type PVs support cross-region restore in current version" error is returned
Issue
The status of the restore task is Failed and the error message "only disk type PVs support cross-region restore in current version" is returned.
Cause
In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. Backups of other volume types cannot be restored across regions.
Solution
If you are using a storage service that supports Internet access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application. For more information, see Mount a statically provisioned ossfs 1.0 volume.
If you need to recover other types of storage data across regions, please submit a ticket.
The status of the restore task is failed and the "ECS snapshot cross region request failed" error is returned
Issue
The status of the restore task is Failed and the "ECS snapshot cross region request failed" error is returned.
Cause
In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions, but the permissions to use Elastic Compute Service (ECS) disk snapshots are not granted.
Solution
If your cluster is an ACK dedicated cluster or a registered cluster that is connected to a self-managed Kubernetes cluster deployed on ECS instances, you must grant the permissions to use ECS disk snapshots. For more information, see Registered cluster.
The status of the restore task is failed and the "accessMode of PVC xxx is xxx" error is returned
Issue
The status of the restore task is Failed and the "accessMode of PVC xxx is xxx" error is returned.
Cause
The AccessMode
of the disk volume to be restored is set to ReadOnlyMany
(read-only multi-mount) or ReadWriteMany
(read-write multi-mount).
When you restore the disk volume, the new volume is mounted by using CSI. Take note of the following items when you use the current version of CSI:
Only volumes with the
multiAttach
feature enabled can be mounted to multiple instances.Volumes whose
VolumeMode
is set toFilesystem
(mounted by using a file system such as ext4 or xfs) can only be mounted to multiple instances in read-only mode.
For more information about disk storage, see Use a dynamically provisioned disk volume.
Solution
If you are using the StorageClass conversion feature to convert a volume that supports multiple mounts, such as an OSS or NAS volume, to a disk volume, and you want to ensure that different replicas of your business can normally share data on the volume, we recommend that you create a new restore task and select
alibabacloud-cnfs-nas
as the target type for StorageClass conversion. This way, a NAS volume managed by CNFS is used. For more information, see Use CNFS to manage NAS file systems (recommended).If the CSI version when you backed up the disk PV was low (without
AccessMode
detection), and the backed-up persistent volume itself does not meet the current CSI creation requirements, you should prioritize dynamically provisioned disk volumes to transform your original workloads. This helps avoid the threat of forced disk detachment when scheduling to other nodes. If you have more questions or requirements about multi-mount scenarios, please submit a ticket for assistance.
The status of the restore task is completed but some resources are not created in the restore cluster
Issue
The status of the restore task is Completed but some resources are not created in the restore cluster.
Cause
Cause 1: The resource is not backed up.
Cause 2: The resource is excluded during restoration based on the configuration.
Cause 3: The application restore subtask partially fails.
Cause 4: The resource is successfully restored but is recycled due to the ownerReferences configuration or business logic.
Solution
Solution 1:
Run the following command to view the backup details:
kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name> --details
Check whether the target resource is backed up. If the target resource is not backed up, check whether it is excluded due to the namespace, resource, or other configurations specified in the backup task, and then back up the resource again. By default, cluster-level resources of running applications (pods) in namespaces that are not selected are not backed up. If you want to back up all cluster-level resources, see Cluster-level backup.
Solution 2:
If the target resource is not restored, check whether it is excluded due to the namespace, resource, or other configurations specified in the restore task, and then restore the resource again.
Solution 3:
Run the following command to identify the resources that fail to be restored and the cause of the failure:
kubectl -n csdr exec -it $(kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe restore <restore-name>
Fix the issues according to the prompts in the Errors
and Warnings
fields in the outputs. If you cannot fix the issues based on the failure reasons, submit a ticket for processing.
Solution 4:
Check the audit of the corresponding resource to determine whether it is abnormally deleted after it is created.
The migrate-controller component in a cluster that uses FlexVolume cannot be launched
migrate-controller does not support clusters that use FlexVolume. To use the backup center feature, use one of the following methods to migrate from FlexVolume to CSI:
Use csi-compatible-controller to migrate from FlexVolume to CSI
Migrate statically provisioned NAS volumes from Flexvolume to CSI
Migrate statically provisioned NAS volumes from Flexvolume to CSI
For other cases, join the DingTalk user group (group ID: 35532895) for consultation.
If you need to back up applications in a cluster that uses FlexVolume and restore the applications in a cluster that uses CSI during the migration from FlexVolume to CSI, see Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version.
Can I modify the backup vault?
You cannot modify the backup vault. If you want to modify the backup vault, you can only delete the current one and create a backup vault with another name.
Because the backup vault is shared, it may be in the Backup or Restore state at any time. If you modify a parameter of the backup vault, the system may fail to find the required data when backing up or restoring an application. Therefore, you cannot modify the backup vault or create backup vaults that use the same name.
Can I associate a backup vault with an OSS bucket whose name is not in the cnfs-oss-* format?
For clusters other than ACK dedicated clusters and registered clusters, the backup center component has read and write permissions on OSS buckets whose names are in the cnfs-oss-*
format by default. To prevent backups from overwriting existing data in the bucket, we recommend that you create a dedicated OSS bucket whose name is in the cnfs-oss-*
format for the backup center.
If you want to associate a backup vault with an OSS bucket whose name is not in the cnfs-oss-* format, you must configure permissions for the component. For more information, see ACK dedicated cluster.
After you grant permissions, run the following command to restart the backup service component:
kubectl -n csdr delete pod -l control-plane=csdr-controller kubectl -n csdr delete pod -l component=csdr
If you have created a backup vault that is associated with an OSS bucket whose name is not in the cnfs-oss-* format, wait until the connectivity check is complete and the status changes to Available before you attempt to back up or restore applications. The interval of connectivity checks is about five minutes. You can run the following command to query the status of the backup vault:
kubectl -n csdr get backuplocation
Expected result:
NAME PHASE LAST VALIDATED AGE a-test-backuplocation Available 7s 6d1h
How do I specify the backup cycle when I create a backup plan?
The backup cycle supports crontab expressions (such as 1 4 * * *
) or interval-based backup (such as 6h30m), which means that a backup is created every 6h30m.
The following describes how to parse crontab expressions. The optional values are the same as the standard crontab expressions, except that the optional values of minute are 0 to 59. *
indicates any available value for the given field. Sample crontab expressions:
1 4 * * *
: Create a backup at 4:01 AM every day.0 2 15 * 1
: Create a backup at 2:00 AM on the 15th day of each month.
* * * * *
| | | | |
| | | | ·----- day of week (0 - 6) (Sun to Sat)
| | | ·-------- month (1 - 12)
| | .----------- day of month (1 - 31)
| ·-------------- hour (0 - 23)
·----------------- minute (0 - 59)
What changes are made to the YAML files of resources when I run a restore task?
When you restore resources, the following changes are made to the YAML files of resources:
Change 1:
If the size of a disk volume is less than 20 GiB, the volume size is changed to 20 GiB.
Change 2:
Services are restored based on Service types:
NodePort Services: By default, Service ports are retained when you restore Services across clusters.
LoadBalancer Services: When ExternalTrafficPolicy is set to Local, HealthCheckNodePort uses a random port by default. If you want to retain the port number, set
spec.preserveNodePorts: true
when you create a restore task.If you restore a Service that uses an existing Server Load Balancer (SLB) instance in the backup cluster, the restored Service uses the same SLB instance and disables the listeners by default. You need to log on to the SLB console to configure the listeners.
If you restore a Service whose SLB instance is managed by CCM in the backup cluster, CCM creates a new SLB instance. For more information, see Considerations for configuring a LoadBalancer Service.
How do I view backup resources?
Resources in cluster application backups
The YAML files in the cluster are stored in the OSS bucket associated with the backup vault. You can use one of the following methods to view backup resources:
Run the following command in a cluster to which backup files are synchronized to view backup resources:
kubectl -n csdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1 kubectl -n csdr exec -it csdr-velero-xxx -c velero -- ./velero describe backup <backup-name> --details
View backup resources in the ACK console:
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
On the Application Backup page, click the Backup Records tab. In the Backup Records column, click the backup record that you want to view.
Resources in disk volume backups
Log on to the ECS console.
In the left-side navigation pane, choose .
In the top navigation bar, select the region and resource group of the resource that you want to manage.
On the Snapshots page, query snapshots based on the disk ID.
Resources in non-disk volume backups
Log on to the Cloud Backup console.
In the left-side navigation pane, choose .
In the top navigation bar, select a region.
View the basic information of cluster backups.
Clusters: The list of clusters that have been backed up and protected. Click ACK Cluster ID to view the protected persistent volume claims (PVCs). For more information about PVCs, see Persistent volume claim (PVC).
If Client Status is abnormal, Cloud Backup is not running as expected in the ACK cluster. Go to the DaemonSets page in the ACK console to troubleshoot the issue.
Backup Jobs: The status of backup jobs.
Can I back up applications in a cluster that runs an earlier Kubernetes version and restore the applications in a cluster that runs a later Kubernetes version?
Yes, you can.
By default, when you back up resources, all API versions supported by the resources are backed up. For example, a Deployment in a cluster that runs Kubernetes 1.16 supports extensions/v1beta1, apps/v1beta1, apps/v1beta2, and apps/v1. When you back up the Deployment, the backup vault stores all four API versions regardless of which version you use when you create the Deployment. The KubernetesConvert feature is used for API version conversion.
When you restore resources, the API version recommended by the restore cluster is used for restoration. For example, if you restore the preceding Deployment in a cluster that runs Kubernetes 1.28 and the recommended API version is apps/v1, the restored Deployment will use apps/v1.
If no API version is supported by both clusters, you must manually deploy the resource. For example, Ingresses in clusters that run Kubernetes 1.16 support extensions/v1beta1 and networking.k8s.io/v1beta1. You cannot restore the Ingresses in clusters that run Kubernetes 1.22 or later because Ingresses in these clusters support only networking.k8s.io/v1. For more information about Kubernetes API version migration, see official documentation. Due to API version compatibility issues, we recommend that you do not use the backup center to migrate applications from clusters with newer Kubernetes versions to clusters with older Kubernetes versions. We also recommend that you do not migrate applications from clusters with Kubernetes versions earlier than 1.16 to clusters with newer Kubernetes versions.
Is traffic automatically switched to SLB instances during restoration?
No, traffic is not automatically switched.
Services are restored based on Service types:
NodePort Services: By default, Service ports are retained when you restore Services across clusters.
LoadBalancer Services: When ExternalTrafficPolicy is set to Local, HealthCheckNodePort uses a random port by default. If you want to retain the port number, set
spec.preserveNodePorts: true
when you create a restore task.If you restore a Service that uses an existing Server Load Balancer (SLB) instance in the backup cluster, the restored Service uses the same SLB instance and disables the listeners by default. You need to log on to the SLB console to configure the listeners.
If you restore a Service whose SLB instance is managed by CCM in the backup cluster, CCM creates a new SLB instance. For more information, see Considerations for configuring a LoadBalancer Service.
By default, after listeners are disabled or new SLB instances are used, traffic is not automatically switched to the new SLB instances. If you use other cloud services or third-party service discovery and do not want automatic service discovery to switch traffic to the new SLB instances, you can exclude Service resources during backup and manually deploy them when you need to switch traffic.
Why are resources in the csdr, ack-csi-fuse, kube-system, kube-public, and kube-node-lease namespaces not backed up by default?Resources in namespaces such as kube-system, kube-public, and kube-node-lease?
csdr is the namespace of the backup center. If you directly back up and restore this namespace, components will fail to work in the restore cluster. Additionally, the backup center has a backup synchronization logic, which means you do not need to manually migrate backups to a new cluster.
ack-csi-fuse is the namespace of the CSI storage component and is used to run FUSE client pods maintained by CSI. When you restore storage in a new cluster, the CSI of the new cluster automatically synchronizes to the corresponding client. You do not need to manually back up and restore this namespace.
kube-system, kube-public, and kube-node-lease are the default system namespaces of Kubernetes clusters. Due to differences in cluster parameters and configurations, you cannot restore these namespaces across clusters. Additionally, the backup center is used to back up and restore applications. Before you run a restore task, you must install and configure system components in the restore cluster, such as:
Container Registry password-free image pulling component: You need to grant permissions to and configure acr-configuration in the restore cluster.
Application Load Balancer (ALB) Ingresses: You need to configure ALBConfigs.
If you directly back up system components in the kube-system namespace to a new cluster, the system components may fail to run in the new cluster.
Does the backup center use ECS disk snapshots to back up disk volumes? What is the default type of snapshots?
In the following scenarios, the backup center uses ECS disk snapshots to back up disk volumes by default:
The cluster is an ACK managed cluster or ACK dedicated cluster.
The cluster runs Kubernetes 1.18 or later and uses CSI 1.18 or later.
In other scenarios, the backup center uses Cloud Backup to back up disk volumes by default.
Disk snapshots created by the backup center have the instant access feature enabled by default. The validity period of disk snapshots is the same as the validity period specified in the backup configuration by default. Starting from October 12, 2023, 11:00, Alibaba Cloud no longer charges for snapshot instant access storage or snapshot instant access operations in all regions. For more information, see Use the instant access feature.
Why is the validity period of ECS disk snapshots created from backups different from the validity period specified in the backup configuration?
The creation of disk snapshots depends on the csi-provisioner component or managed-csiprovisioner component of a cluster. If the version of the csi-provisioner component is earlier than 1.20.6, you cannot specify the validity period or enable the instant access feature when you create VolumeSnapshots. In this case, the validity period in the backup configuration does not affect disk snapshots.
Therefore, when you use the volume data backup feature for disk volumes, you must upgrade the csi-provisioner component to 1.20.6 or later.
If csi-provisioner cannot be upgraded to this version, you can configure the default snapshot validity period in the following ways:
Update the backup center component migrate-controller to v1.7.10 or later.
Run the following command to check whether a VolumeSnapshotClass whose retentionDays is 30 exists in the cluster:
kubectl get volumesnapshotclass csdr-disk-snapshot-with-default-ttl
If the VolumeSnapshotClass does not exist, you can use the following YAML to create a VolumeSnapshotClass named csdr-disk-snapshot-with-default-ttl.
If the VolumeSnapshotClass exists, set the
retentionDays
parameter of the default csdr-disk-snapshot-with-default-ttl VolumeSnapshotClass to 30.apiVersion: snapshot.storage.k8s.io/v1 deletionPolicy: Retain driver: diskplugin.csi.alibabacloud.com kind: VolumeSnapshotClass metadata: name: csdr-disk-snapshot-with-default-ttl parameters: retentionDays: "30"
After the configuration is complete, all disk volume backups created in the cluster will create disk snapshots with the same validity period as the
retentionDays
field.ImportantIf you want the validity period of ECS disk snapshots created from backups to be the same as the validity period specified in the backup configuration, we recommend that you upgrade the csi-provisioner component to 1.20.6 or later.
In what scenarios do I need to back up volumes when I back up applications?
What is volume data backup?
Volume data is backed up to cloud storage using ECS disk snapshots or the Cloud Backup service. When you restore the application, the data is stored in a new disk or NAS file system for the restored application to use. The restored application and the original application do not share data sources and do not affect each other.
If you do not need to copy data or have shared data source requirements, you can choose not to back up volume data and ensure that the list of excluded resources in the backup does not include PVC and PV resources. During restoration, the volumes are deployed to the new cluster based on the original YAML files.
In what scenarios do I need to back up volumes?
You want to implement data replication and disaster recovery.
The storage type is disk volume because basic disks can be mounted to only a single node.
You want to implement cross-region backup and restoration. In most cases, storage types other than OSS do not support cross-region access.
You want to isolate data between the backup application and the restored application.
The storage plug-ins or versions of the backup cluster and the restore cluster are significantly different, and the YAML files cannot be directly restored.
What are the risks of not backing up volumes for stateful applications?
If you do not back up volumes when you back up stateful applications, the following behaviors occur during restoration:
For volumes whose reclaim policy is Delete:
Similar to when you deploy a PVC for the first time, if the restore cluster has a corresponding StorageClass, CSI automatically creates a new PV. For example, for disk storage, a new empty disk is mounted to the restored application. For static volumes that do not have a StorageClass specified or if the restore cluster does not have a corresponding StorageClass, the restored PVC and pod remain in the Pending state until you manually create a corresponding PV or StorageClass.
For volumes whose reclaim policy is Retain:
During restoration, resources are restored in the order of PV first and then PVC based on the original YAML files. For storage that supports multiple mounts, such as NAS and OSS, the original file system or bucket can be directly reused. For disks, there may be a risk of forced disk detachment.
You can run the following command to query the reclaim policy of volumes:
kubectl get pv -o=custom-columns=CLAIM:.spec.claimRef.name,NAMESPACE:.spec.claimRef.namespace,NAME:.metadata.name,RECLAIMPOLICY:.spec.persistentVolumeReclaimPolicy
Expected result:
CLAIM NAMESPACE NAME RECLAIMPOLICY
www-web-0 default d-2ze53mvwvrt4o3xxxxxx Delete
essd-pvc-0 default d-2ze5o2kq5yg4kdxxxxxx Delete
www-web-1 default d-2ze7plpd4247c5xxxxxx Delete
pvc-oss default oss-e5923d5a-10c1-xxxx-xxxx-7fdf82xxxxxx Retain
How do I select nodes that can be used to back up file systems in data protection?
By default, when you back up storage volumes other than Alibaba Cloud disk volumes, Cloud Backup is used for data backup and restoration. In this case, a Cloud Backup task must be executed on a node. The default scheduling policy of ACK Scheduler is the same as that of the Kubernetes scheduler. You can also configure tasks to be scheduled only to specific nodes based on your business requirements.
Cloud Backup tasks cannot be scheduled to virtual nodes.
By default, backup tasks are low-priority tasks. For the same backup task, a maximum of one volume backup task can be executed on a node.
Node scheduling policies of the backup center
exclude policy (default): By default, all nodes can be used for backup and restoration. If you do not want Cloud Backup tasks to be scheduled to specific nodes, add the
csdr.alibabacloud.com/agent-excluded="true"
label to the nodes.kubectl label node <node-name-1> <node-name-2> csdr.alibabacloud.com/agent-excluded="true"
include policy: By default, nodes without labels cannot be used for backup and restoration. Add the
csdr.alibabacloud.com/agent-included="true"
label to nodes that are allowed to execute Cloud Backup tasks.kubectl label node <node-name-1> <node-name-2> csdr.alibabacloud.com/agent-included="true"
prefer policy: By default, all nodes can be used for backup and restoration. The scheduling priority is as follows:
Nodes with the
csdr.alibabacloud.com/agent-included="true"
label have the highest priority.Nodes without special labels have the second highest priority.
Nodes with the
csdr.alibabacloud.com/agent-excluded="true"
label have the lowest priority.
Change the node selection policy
Run the following command to edit the
csdr-config
ConfigMap:kubectl -n csdr edit cm csdr-config
Add the
node_schedule_policy
configuration to theapplicationBackup
configuration. Example:Run the following command to restart the
csdr-controller
Deployment for the configuration to take effect:kubectl -n csdr delete pod -lapp=csdr-controller
What are the scenarios for application backup and data protection?
Application backup:
You want to back up your business in your cluster, including applications, Services, and configuration files.
Optional: When you back up an application, you want to also back up the volumes mounted to the application.
NoteThe application backup feature does not back up volumes that are not mounted to pods.
If you want to back up applications and all volumes, you can create data protection backup tasks.
You want to migrate applications between clusters and quickly restore applications for disaster recovery.
Data protection:
You want to back up volumes, including only PVCs and PVs.
You want to restore PVCs, which are independent of the backup data. When you use the backup center to restore a deleted PVC, a new disk is created and the data on the disk is identical to the data in the backup file. In this case, the mount parameters of the new PVC remain unchanged. The new PVC can be directly mounted to applications.
You want to implement data replication and disaster recovery.
Does the backup center support data encryption for associated OSS buckets? How do I grant the permissions to use Key Management Service (KMS) for server-side encryption?
OSS buckets support both server-side encryption and client-side encryption. However, the backup center supports only server-side encryption for OSS buckets. You can manually enable server-side encryption for the OSS bucket that you associate with the backup center and configure the encryption method in the OSS console. For more information about server-side encryption for OSS buckets and how to enable it, see Server-side encryption.
If you use a customer master key (CMK) managed by KMS for encryption and decryption and use your own key (BYOK), which means that you specify a CMK ID, you need to grant the backup center permissions to access KMS. Follow these steps:
Create a custom policy. For more information, see Create custom policies.
{ "Version": "1", "Statement": [ { "Effect": "Allow", "Action": [ "kms:List*", "kms:DescribeKey", "kms:GenerateDataKey", "kms:Decrypt" ], "Resource": [ "acs:kms:*:141661496593****:*" ] } ] }
The preceding policy allows the backup center to call all KMS keys under the Alibaba Cloud account ID. If you need more fine-grained Resource configuration, see Authorization information.
For ACK dedicated clusters and registered clusters, grant permissions to the RAM user that is used during installation. For more information, see Grant permissions to a RAM user. For other clusters, grant permissions to the AliyunCSManagedBackupRestoreRole role. For more information, see Grant permissions to a RAM role.
If you use a KMS key managed by OSS or use a key fully managed by OSS for encryption and decryption, you do not need to grant additional permissions.
How do I change the images used by applications during restoration?
Assume that the image used by the application in the backup is: docker.io/library/app1:v1
Change the image repository address (registry)
In hybrid cloud scenarios, you may need to deploy an application across the clouds of multiple cloud service providers or you may need to migrate an application from the data center to the cloud. In this case, you must upload the image used by the application to an image repository on Container Registry.
You must use the imageRegistryMapping field to specify the image repository address. For example, the following configuration changes the image to
registry.cn-beijing.aliyuncs.com/my-registry/app1:v1
.docker.io/library/: registry.cn-beijing.aliyuncs.com/my-registry/
Change the image repository (repository) and version
Changing the image repository and version is an advanced feature. Before you create a restore task, you must specify the change details in a ConfigMap.
If you want to change the image repository to
app2:v2
, create the following configuration:apiVersion: v1 kind: ConfigMap metadata: name: <ConfigMap name> namespace: csdr labels: velero.io/plugin-config: "" velero.io/change-image-name: RestoreItemAction data: "case1":"app1:v1,app2:v2" # If you want to change only the image repository, use the following setting. # "case1": "app1,app2" # If you want to change only the image version, use the following setting. # "case1": "v1:v2" # If you want to change only an image in an image repository, use the following setting. # "case1": "docker.io/library/app1:v1,registry.cn-beijing.aliyuncs.com/my-registry/app2:v2"
If you have multiple change requirements, you can continue to configure case2, case3, and so on in the
data
field.After the ConfigMap is created, create a restore task as normal and leave the imageRegistryMapping field empty.
NoteThe changes take effect on all restore tasks in the cluster. We recommend that you configure fine-grained modifications based on the preceding description. For example, configure image changes within a single repository. If the ConfigMap is no longer required, delete it.