×
Community Blog Alibaba Cloud ACK Backup Center: A One-stop Disaster Recovery Solution for Kubernetes Cluster Business Applications

Alibaba Cloud ACK Backup Center: A One-stop Disaster Recovery Solution for Kubernetes Cluster Business Applications

This article is based on Su Yashi's speech at the 2024 Apsara Conference.

Watch the replay of the Apsara Conference 2024 at this link!

Why Do Kubernetes Cluster Businesses Need Disaster Recovery?

High availability configurations for clusters and applications are the foundation of cluster stability, ensuring that applications continue to operate reliably even when there are unexpected failures in infrastructure.

However, during the rapid iteration of applications and daily operation and maintenance of the cluster, human errors such as accidental deletion of cluster resources can occur. For critical business applications, it is recommended to perform periodic disaster recovery and a single backup before business iterations, rollbacks, and high-risk operations within the cluster to reduce the RTO and RPO when related failures happen.

Therefore, effective disaster recovery measures serve as the last line of defense for business stability.

What Are the Disaster Recovery Features and New Requirements After Business Containerization?

For businesses running in Kubernetes clusters, the objects and goals of disaster recovery have changed:

Since the business has been orchestrated by the Kubernetes, the objects of disaster recovery should be the cluster resources with workloads (that is, container groups) as the core and the associated cloud resource information. For stateful applications, additional considerations must be given to the disaster recovery for data within storage volumes.

The recovery after containerization focuses more on the continuity of the business. The objective is to relaunch the workload, keep the relevant configuration unchanged, and restore the external service. The recovery can be either in-place (the backup cluster and recovery cluster are the same) or cross-cluster (they are not the same cluster).

_1

Backup Center: Kubernetes Disaster Recovery Solution in the Application Dimension of Container Native

ACK provides a one-stop disaster recovery solution for backup centers to meet new disaster recovery features and requirements.

Backup center overview:

https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/backup-center-overview

Cluster O&M personnel can use the console to create periodic backup plans or one-time application backups with a single click. Compared with ETCD backup, the backup center supports selecting applications to back up based on dimensions such as namespaces, labels, and resource types. For stateful applications, it supports simultaneously backing up storage volume data mounted by the business.

For enterprises with established GitOps processes, the data protection feature of the backup center can be used to perform disaster recovery exclusively for storage volume data.

Before recovery, simple adjustments to resources such as namespaces and image registry mappings can be made through configuration.

For more complex and advanced adjustment needs, flexible configurations such as traffic redirection, replica count adjustments, and configuration file modifications can be achieved through ConfigMap settings.

When recovery is required, the target backup supports restoring either entire or partial applications and storage volumes. In addition to pre-configured adjustment strategies, the backup center will also automatically adjust resource recovery orders and certain configurations, such as switching storage drivers in cross-cloud scenarios, during the recovery process, ensuring compatibility with ACK system components and Alibaba Cloud ecosystem.

_2

Backup Center Console Demonstration

The application backup console of the cluster provides the relevant usage process guidelines.

After creating an OSS bucket for storing backups and associating it with a backup vault, you can create a backup plan or initiate an immediate backup in the console.

The status and details of backups are displayed in the backup record list.

To restore a backup, simply click "Immediate Restore" for the target backup. If there are no advanced configuration requirements, the backup and restoration of application data can be accomplished within a single console.

_3

Next, using a MySQL stateful application as an example, we will discuss the challenges faced in achieving the goal of coherent business recovery.

Challenges and Solutions for Container Native Cluster Resource Disaster Recovery with the Backup Center

As Kubernetes continues to evolve, the number of official resource types it provides is increasing, along with various versions of these resources and user-defined resource types. This means the number of resources involved in backup and recovery operations is also growing.

Taking MySQL as an example, during deployment, you might use Secret to store accounts and passwords, ConfigMap to store startup configurations, PVC and PV to record underlying storage information, and Service and Ingress to store ports and network resource information.

With so many resource types, the first challenge is how to ensure the completeness of backups. Any missing resource can lead to the failure of the application during recovery.

-The backup center, based on the definitions of Kubernetes resources, automatically appends dependent resources during backup according to their dependencies. For example, if MySQL uses a custom resource, the relevant CRD will be automatically appended during backup, ensuring successful deployment of the custom resource in a new cluster.

Does a complete backup guarantee a complete recovery?

Actually, improper recovery orders can also result in the loss of business operational states. For instance, if MySQL’s StatefulSet (STS) is recovered first, the Kubernetes STS Controller will automatically create replicas (Pods) based on the STS’s replica count and template. As a result, the configuration information patched onto the backed-up Pod resources would be overwritten and lost.

-By default, the backup center maintains the recovery order of official Kubernetes resources, prioritizing the deployment of resources dynamically created by controllers.

Finally, if you attempt to use kubectl get -oyaml to obtain runtime resource configurations, you will find that before applying this configuration file, you need to manually clean up or adjust a large number of fields. For example, if the nodeName field appended by the controller after scheduling is retained during recovery, the scheduling phase will be skipped, leading to the inability to launch the application in a new cluster.

-Similarly, the backup center is compatible with the Kubernetes ecosystem and performs default adjustments to resources. If users have specific adjustment needs, they can customize correction strategies via configuration.

All automatic adjustments for backup and recovery are strongly dependent on the version of Kubernetes. The backup center, while remaining compatible with older Kubernetes cluster versions, will continue to iterate alongside the community.

_4

Challenges and Solutions for Container Native Storage Volume Data Disaster Recovery with the Backup Center

When deploying MySQL, you might need cloud disks to store actual DB data and NAS and OSS to store runtime information of logs.

Different storage systems have varying native disaster recovery methods, such as snapshots for cloud disks and recycle bins for NAS.

  • The backup center abstracts different types of storage volumes into block storage and file system storage, shielding users from the implementation of underlying storage backup. When backing up storage volume data, users do not need to concern themselves with the type of underlying storage or configure each one individually. All storage volume backup strategies are consistent.

Even if storage sources are backed up and recovered, applications cannot directly use the recovered storage sources like cloud disks. This is because applications require PVC and PV resources as a bridge to connect with the underlying storage layer. In Kubernetes clusters, applications specify mounting information using PVC, which has a one-to-one binding relationship with PV, and PV records the actual storage source information, such as the cloud disk instance ID.

  • The backup center, when backing up storage volume data, targets PVCs for both backup and recovery. This means it backs up both PVC and PV configurations simultaneously. During recovery, it adjusts the storage information of PV and restores PV and PVC together, enabling applications to reuse them seamlessly.

For ACK users who previously used the deprecated FlexVolume storage driver, or for those migrating from IDCs or third-party clouds to ACK, there is a challenge of inconsistent storage drivers within Kubernetes clusters.

  • For ACK clusters, the backup center supports backing up data from ACK FlexVolume, NFS, Ceph, and other off-tree Kubernetes drivers, as well as mainstream cloud vendor CSI, and restoring it to ACK CSI. During the recovery process, it translates fields on PV’s volumeSource to ACK-maintained CSI.

_5

Overview of the Backup Center Principles and Components

The above sections introduce the characteristics, challenges, and solutions provided by the backup center for containerized business disaster recovery. However, you might still have some questions:

Where is my backup data stored? Which cloud product ensures the SLA for backup data?

If I use scheduled backups, such as once daily at midnight, won’t the data volume and overhead be significant?

These are concerns that users often have.

In fact, each backup (and recovery) task can be decomposed into up to three sub-tasks: backup sub-tasks for cluster resources, block storage data, and file system storage data. The backup strategies for these three sub-tasks, such as TTL, frequency, and target applications, are consistent.

Cluster resource backup: It is developed based on the open-source Velero community and integrated with Alibaba Cloud ecosystem and ACK system components via internal plugins. When users perform a backup, they only need to focus on their business. All cluster resources (of multiple API versions) will be backed up and stored in the backup vault. The backup vault is actually linked to the OSS bucket provided by the user.

Block storage data protection: Based on the cloud disk snapshot feature of Alibaba Cloud, it offers quick backup and guarantees data consistency for individual disks (or multiple disks for snapshot-consistent groups, pending). For widely used ESSD cloud disks, IA snapshots are enabled by default, allowing backup and recovery available within seconds. Additionally, snapshots are incremental.

File system data protection: Based on Alibaba Cloud backup service, it provides capabilities such as increment, compression, deduplication, and encryption to speed up the backup and recovery process. Through the storage class conversion feature of the backup center, the data backed up by the file system can be converted into other file system types, such as an ext4-mounted cloud disk or NAS.

_6

Application of the Backup Center in Hybrid Cloud Scenarios

In-place Disaster Recovery

The disaster recovery capabilities on the cloud can also assist on-premises Kubernetes clusters.

By connecting on-premises IDC clusters or Kubernetes clusters from other cloud vendors to ACK One’s registered clusters and deploying the backup center components, cloud-based disaster recovery can be easily achieved. Similar to ACK clusters, cluster resources will be backed up to Alibaba Cloud OSS, and storage volume data will be backed up to Alibaba Cloud backup service.

Cloud Migration

By backing up in the registered cluster and restoring in the ACK cluster, easy migration to the cloud can be achieved.

For storage volumes of stateful applications, data migration to the cloud can also be realized through the storage class conversion feature.

Overview of registered clusters in ACK One: https://www.alibabacloud.com/help/en/ack/overview-9

_7

More Scenarios and Capabilities of Interest to Users

Since its release, an increasing number of users have been using the backup center to achieve seamless migration across major Kubernetes versions, migration across VPCs, and migration of IDCs to the cloud. Here is a summary of some capabilities that users are particularly interested in:

Storage class conversion:

After backing up file system data, the storage media can be changed through storage class conversion. This has already been mentioned, so it won't be elaborated further here.

API version switching with cluster version:

The Kubernetes itself has the converting function possessed by the API version. Based on the function, Velero supports, for example, backing up a Deployment in a 1.16 cluster and recovering it in a 1.30 cluster:

During backup, all API versions of the resources are backed up, such as Deployment being backed up with extensions/v1beta1, apps/v1beta1, apps/v1beta2, and apps/v1.

During recovery, the recommended version of the recovery cluster is preferentially switched, that is, apps v1.

If the recommended version does not exist in the backup, then a compatible version supported by the recovery cluster is restored.

In version 1.16, most important groups like core, apps, network.k8s.io, and storage.k8s.io have resources at v1, but some resources like Ingress cannot be restored in clusters of version 1.22+. On top of the community efforts, the backup center further enhances compatibility for such resources, enabling automatic switching (or manual creation by the customer).

Traffic switchover for external services:

Different recovery scenarios may have varying requirements for Service restoration, such as whether to reuse ports, load balancer instances, or enable forced listening. These requirements can be addressed through configurable adjustment strategies. For cluster business migration scenarios, this feature ensures that the application is launched normally and traffic can be manually switched after business checks, without affecting external services.

Solutions for handling existing resources during recovery:

By default, the safer logic of skipping resources with the same name is applied. For scenarios requiring upgrades or changes, an attempt can be made to update existing resources using the Kubernetes JsonMergedPatch method.

_8

Demo: Business Backup and Recovery from a 1.16 Cluster to a 1.30 Cluster

The video demo showcases the functions mentioned above. The configurations and business requirements for the backup and recovery clusters are as follows:

_9

In the demo:

First, the installation status of components in the 1.16 cluster is shown, along with the deployed MySQL application. Focus is placed on checking the load balancer ID of the LoadBalancer type Service, the volume type, and simulated DB data. (The process of excluding OSS storage volume data backup by labeling the MySQL application is omitted.)

The backup is performed in the backup center, verifying the status of the backed-up storage volumes and cluster resources.

Then, the installation status of components in the 1.30 cluster is shown. In the backup center, it is confirmed that the backup has been synchronized to the current cluster and restored (in some scenarios, related ConfigMaps need to be pre-created for service traffic switchover).

The storage class conversion of the volume is checked, along with the restored application, emphasizing that the load balancer ID of the LoadBalancer type Service remains unchanged and the simulated DB data is recovered.

One Figure to Summarize the Disaster Recovery Features and Solutions of Kubernetes Cluster Services

_10

0 1 0
Share on

Alibaba Container Service

167 posts | 30 followers

You may also like

Comments

Alibaba Container Service

167 posts | 30 followers

Related Products