You can interface the backup center with Managed Service for Prometheus and use Managed Service for Prometheus to monitor the status of backup vaults and backup tasks in the current cluster. This allows you to gain real-time insights into application migration and disaster recovery. This topic describes how to use the dashboards provided by Managed Service for Prometheus to monitor the backup center and how to configure alerting for backup task failures.
Prerequisites
The backup service component migrate-controller is installed and the version of the component is v1.7.10 or later. For more information, see Install migrate-controller and grant permissions and Manage components.
You cannot install the latest migrate-controller version in clusters that run a Kubernetes version earlier than 1.20. To use the backup center monitoring feature, update the Kubernetes version of your cluster first. For more information, see Manually update a cluster.
Managed Service for Prometheus is enabled for the cluster.
Billing
The migrate-controller component automatically sends metrics to Managed Service for Prometheus. These metrics are considered custom metrics. Using custom metrics incurs additional fees.
We recommend that you read Metrics before you enable the backup center monitoring feature to learn the billing rules for custom metrics. The fees may vary based on the cluster size and number of applications. You can also view resource usage in Managed Service for Prometheus.
Interface the backup center with Managed Service for Prometheus
You can use Managed Service for Prometheus to monitor the status of backup vaults associated with a cluster and the status of backup tasks in the cluster.
Log on to the ARMS console.
In the left-side navigation pane, click Integration Center. On the Infrastructure tab, search for Backupcenter or Ack Backupcenter and click Backupcenter to go to the integration page.
On the Start Integration tab, select the Container Service for Kubernetes (ACK) cluster that has the backup center installed and click OK.
After the integration is complete, you can log on to the ACK console or ARMS console to view the dashboards.
View backup center dashboards
Procedure
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
On the Prometheus Monitoring page, click the Others tab and view the backup center dashboards below ACK BackupCenter.
For more information about how to view the backup center dashboards in the ARMS console, see View a dashboard.
Dashboard introduction
The following backup center dashboards are supported: Backup Locations (backup vault information), Backup Operation Status (backup task information), and Addon Status (working component information).
Backup Locations
This dashboard displays the detailed information (Backuplocation Detail) about backup vaults associated with the current cluster.
A backup vault stores backup files and declares the association between the backup center and an Object Storage Service (OSS) bucket. The backup center can perform backup, snapshot, and restore tasks only after a backup vault enters the Available state. The following table describes the Backuplocation Detail metrics.
Metric | Description |
Backuplocation | The name of the backup vault. |
OSS Bucket | The name of the OSS bucket associated with the backup vault. |
Region | The region of the OSS bucket, such as cn-hangzhou. |
NetworkPolicy | The type of the network connection between the backup vault and OSS bucket. Valid values:
|
Phase | The status of the backup vault. Valid values:
|
Backup Operation Status
This dashboard displays the status of backup tasks, including an overview of all backup tasks (Backup Overview) and the details of failed backup tasks (Failed Backup Detail).
Backup Overview: displays the number of backup tasks created in each backup vault in the current cluster through a histogram. The backup tasks include instant backup tasks and scheduled backup tasks. The X axis displays the names of backup vaults and the Y axis displays the number of backup tasks in each backup vault. The following table describes the Backup Overview metrics.
Metric
Description
Backup (Failed)
The red bar displays the number of failed backup tasks.
Backup (Completed)
The green bar displays the number of successful backup tasks.
Failed Backup Detail: displays the basic information of failed backup tasks in the current cluster. The following table describes the Failed Backup Detail metrics.
Metric
Description
Backup
The name of the backup task.
Backuplocation
The name of the backup vault to which the backup task belongs.
BackupType
The backup mode of the backup task. Valid values:
AppBackup: creates only application backups (YAML backups).
AppAndPvBackup: creates application and data backups. YAML files and data stored in persistent volumes (PVs) are backed up.
DataType
The type of data backups. Valid values:
snapshot: The PVs are disk volumes.
hbr: The PVs are file system volumes, including HostPath local volumes, NAS volumes, and OSS volumes.
all: The PVs include disk volumes and file system volumes.
none: Data backup is enabled. However, no PV is used in the specified namespace.
FromSchedule
The type of the backup task. Valid values:
Empty: instant backup task.
Not empty: scheduled backup task. The name of the backup plan is displayed.
Addon Status
This dashboard displays the status of the csdr-controller and csdr-velero working components. Make sure that the working components run as normal so that the backup center can run backup, snapshot, and restore tasks.
After the backup center component migrate-controller is installed, it runs a precheck on the cluster. After the precheck is complete, migrate-controller deploys the csdr-controller and csdr-velero working components in the csdr namespace of the backup center.
The csdr-controller and csdr-velero working components run in Deployment pods. The following table describes the Addon Status metrics.
Metric | Description |
Age | The uptime of the working component. |
Status | The status of the working component. Valid values:
|
Pods | The detailed information of the working component pod. |
Memory Request | The amount of memory resources reserved for the working component. |
CPU Request | The amount of CPU resources reserved for the working component. |
Memory Limit | The memory upper limit of the working component. |
CPU Limit | The CPU upper limit of the working component. |
Configure alerting for backup task failures
Alerts for backup task failures are event alerts. A applicationbackups CustomResourceDefinition (CRD) in the csdr.alibabacloud.com resource group is created for each backup task. When the backup task fails, the CRD generates a WARN event.
Query WARN events generated for failed backup tasks
Run the following command to query WARN events generated for failed backup tasks:
kubectl -ncsdr get events --field-selector='type!=Normal'
Expected output:
VaultError: backup vault is unavailable: oss: service returned error: StatusCode=403, ErrorCode=AccessDenied, ErrorMessage="The bucket you access does not belong to you.", RequestId=668516BC35F915******
VaultError displays the cause of failure.
Configure alert rules to generate WARN events for backup task failures
Use the alerting feature of ACK clusters to configure alert rules. For more information, see Alert management.
Analyze abnormal monitoring data
Troubleshoot the issue that a working component does not exist or in the abnormal state (UnHealth)
After the backup center is installed, the working component cannot be found or is repeatedly deployed.
Run the following command to query the status of the migrate-controller component:
kubectl -n kube-system get pod -l app=migrate-controller
If the component is in the
CrashLoopBackOff
state or keeps restarting, the cluster fails to pass the precheck. Typically, this issue occurs because the cluster uses FlexVolume or the registered cluster does not have the required permissions. For more information, see FAQ about the backup center and Registered cluster.The UnHealth state of the working component lasts a long period of time. The pod dashboard does not display any data or abnormal states.
The pod of the working component cannot be started. For more information, see Pod troubleshooting.
The working component is in the Health state but the number of restarts displayed in the pod dashboard is not 0.
The memory usage of the csdr-velero component experiences a spike during the backup process. In this scenario, Out-of-Memory (OOM) errors can easily occur, which cause the component to exceptionally exit. You can increase the memory usage to resolve this issue.
NoteIf the pod of the working component exceptionally exits during the backup process, the backup task will fail or remain in the InProgress state for a long period of time.
Troubleshoot the issue that the backup vault is in the abnormal state (Unavailable)
Run the following command to view the error message.
Replace <unavailable-backuplocation-name>
with the name of the backup vault in the abnormal state.
kubectl -ncsdr describe backuplocation <unavailabe-backuplocation-name>
For more information about troubleshooting backup vault exceptions, see FAQ about the backup center.
Troubleshoot backup task failures
Use the CLI
Run the following command to view the error message.
Replace <failed-applicationbackup-name>
with the name of the failed backup task.
kubectl -ncsdr describe applicationbackup <failed-applicationbackup-name>
For more information about troubleshooting backup task failures, see FAQ about the backup center.
Use the console
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
On the Application Backup page, click the Backup Records tab, find the failed backup task, and click Failed in the Status column to view the error message.