This document provides best practices for online service high availability. These practices help minimize service interruptions and improve system stability and reliability.
Quick start guide
Before you learn about the best practices for high availability, read the following documents to get started with ApsaraDB for SelectDB and understand its features.
(Required) Quick Start
This document describes the basic concepts of ApsaraDB for SelectDB, describes the process of purchasing and using an instance, and highlights key considerations for database table design. This information helps you quickly get started with ApsaraDB for SelectDB.
(Optional) Data migration
Migrate data from various data sources, such as MySQL, PostgreSQL, and Doris, to ApsaraDB for SelectDB.
(Optional) Performance Testing
Run performance tests using Star Schema Benchmark, TPC-H Benchmark, and TPC-DS Benchmark.
(Optional) Solutions
Release and change specifications
(Required) Test in advance
Before you release a new feature, thoroughly test it on a test instance. Before you launch a high-load service, run additional performance stress tests to evaluate its performance in the production environment.
(Required) Grayscale release
Release changes during off-peak hours. Use a phased grayscale release method, such as 10%→50%→100%. Observe the service for 10 to 30 minutes between each phase. Also, closely monitor the service during the first business peak because off-peak hours may not reveal potential issues.
O&M recommendations
(Required) Business monitoring
ApsaraDB for SelectDB provides extensive monitoring and alert features. To monitor your services from a business perspective, combine multiple key metrics. Pay special attention to average query response time, 99th percentile query response time, query success rate, data import speed, CPU utilization, and memory usage. For more information, see Set alert rules.
Capacity management
For high-load services, perform optimization and stress testing in advance. This helps you evaluate the maximum queries per second (QPS) that your resources can support. Evaluate and scale out your resources based on business growth or before promotional events.
Version updates
ApsaraDB for SelectDB continuously fixes bugs through minor version updates. These version numbers consist of three or four parts. We recommend that you promptly upgrade to the latest patch version for your current minor version. Before you upgrade, test the new version in a test environment. In an emergency, contact technical support for a rollback. For example, version 4.0.4.2 was released on February 6, 2025. Over the next six months, it was updated 12 times to version 4.0.6.1. Upgrading promptly helps you avoid known issues.
Service isolation
For completely independent business scenarios, use separate instances. For scenarios where different services use the same data, use a multi-compute group architecture. This architecture provides multiple physical compute queues within a single instance, and these queues share data from the read-write instance.
Operational drills
Instance changes
Common O&M operations include upgrades and scale-outs. Rehearse these operations in advance to verify the impact of instance changes on your services. During changes to an ApsaraDB for SelectDB instance, transient connection interruptions can occur. Your application must be able to handle and retry failed connections.
Fault recovery
ApsaraDB for SelectDB provides several fault recovery solutions. For example, you can quickly switch to a new compute group if one fails, restart a failed instance, or restart a stalled compute group. We recommend that you rehearse these solutions in advance to familiarize yourself with the procedures.
High availability architecture recommendations
(Optional) Service throttling
ApsaraDB for SelectDB supports Workload Groups, which are logical task queues. You can use Workload Groups to control the resources used by different types of requests or services and limit their maximum resource usage. This provides service throttling capabilities during traffic bursts.
(Optional) Multi-zone disaster recovery
ApsaraDB for SelectDB supports multi-zone disaster recovery deployments. If a zone fails, the system performs an automatic switchover. The recovery time objective (RTO) for the switchover is approximately 10 seconds.
(Optional) Data backup and recovery
For highly sensitive online services, we recommend that you enable regular daily backups. You can also perform manual backups before you make important changes. In the event of a critical failure, you can use backup data to quickly recover your services.