All Products
Search
Document Center

ApsaraDB for SelectDB:Best practices for online service high availability

Last Updated:Oct 13, 2025

This document provides best practices for online service high availability. These practices help minimize service interruptions and improve system stability and reliability.

Quick start guide

Before you learn about the best practices for high availability, read the following documents to get started with ApsaraDB for SelectDB and understand its features.

  • (Required) Quick Start

    This document describes the basic concepts of ApsaraDB for SelectDB, describes the process of purchasing and using an instance, and highlights key considerations for database table design. This information helps you quickly get started with ApsaraDB for SelectDB.

  • (Optional) Data migration

    Migrate data from various data sources, such as MySQL, PostgreSQL, and Doris, to ApsaraDB for SelectDB.

  • (Optional) Performance Testing

    Run performance tests using Star Schema Benchmark, TPC-H Benchmark, and TPC-DS Benchmark.

  • (Optional) Solutions

    Observability and data lakehouse.

Release and change specifications

  • (Required) Test in advance

    Before you release a new feature, thoroughly test it on a test instance. Before you launch a high-load service, run additional performance stress tests to evaluate its performance in the production environment.

  • (Required) Grayscale release

    Release changes during off-peak hours. Use a phased grayscale release method, such as 10%→50%→100%. Observe the service for 10 to 30 minutes between each phase. Also, closely monitor the service during the first business peak because off-peak hours may not reveal potential issues.

O&M recommendations

  • (Required) Business monitoring

    ApsaraDB for SelectDB provides extensive monitoring and alert features. To monitor your services from a business perspective, combine multiple key metrics. Pay special attention to average query response time, 99th percentile query response time, query success rate, data import speed, CPU utilization, and memory usage. For more information, see Set alert rules.

  • Capacity management

    For high-load services, perform optimization and stress testing in advance. This helps you evaluate the maximum queries per second (QPS) that your resources can support. Evaluate and scale out your resources based on business growth or before promotional events.

  • Version updates

    ApsaraDB for SelectDB continuously fixes bugs through minor version updates. These version numbers consist of three or four parts. We recommend that you promptly upgrade to the latest patch version for your current minor version. Before you upgrade, test the new version in a test environment. In an emergency, contact technical support for a rollback. For example, version 4.0.4.2 was released on February 6, 2025. Over the next six months, it was updated 12 times to version 4.0.6.1. Upgrading promptly helps you avoid known issues.

  • Service isolation

    For completely independent business scenarios, use separate instances. For scenarios where different services use the same data, use a multi-compute group architecture. This architecture provides multiple physical compute queues within a single instance, and these queues share data from the read-write instance.

Operational drills

  • Instance changes

    Common O&M operations include upgrades and scale-outs. Rehearse these operations in advance to verify the impact of instance changes on your services. During changes to an ApsaraDB for SelectDB instance, transient connection interruptions can occur. Your application must be able to handle and retry failed connections.

  • Fault recovery

    ApsaraDB for SelectDB provides several fault recovery solutions. For example, you can quickly switch to a new compute group if one fails, restart a failed instance, or restart a stalled compute group. We recommend that you rehearse these solutions in advance to familiarize yourself with the procedures.

High availability architecture recommendations

  • (Optional) Service throttling

    ApsaraDB for SelectDB supports Workload Groups, which are logical task queues. You can use Workload Groups to control the resources used by different types of requests or services and limit their maximum resource usage. This provides service throttling capabilities during traffic bursts.

  • (Optional) Multi-zone disaster recovery

    ApsaraDB for SelectDB supports multi-zone disaster recovery deployments. If a zone fails, the system performs an automatic switchover. The recovery time objective (RTO) for the switchover is approximately 10 seconds.

  • (Optional) Data backup and recovery

    For highly sensitive online services, we recommend that you enable regular daily backups. You can also perform manual backups before you make important changes. In the event of a critical failure, you can use backup data to quickly recover your services.