Restarting nodes directly may cause an exception in clusters. In the context of Alibaba Cloud use cases, this document introduces the best practices for restarting nodes in the situations such as performing active Operation & Maintenance (O&M) on Container Service.

Check the high availability configurations of business

Before restarting Container Service nodes, we recommend that you check or modify the following business configurations. In this way, restarting nodes cannot cause the exception of a single node and the business availability cannot be impaired.

  • Data persistence policy of configurations

    We recommend the data persistence for external volumes of important data configurations such as configurations of logs and business. In this way, after the container is restructured, deleting the former container cannot cause the data loss.

    For how to use the Container Service data volumes, see Manage data volumes.

  • Restart policy of configurations

    We recommend that you configure the restart: always restart policy for the corresponding business services so that containers can be automatically pulled up after the nodes are restarted.

  • High availability policy of configurations

    We recommend that you integrate with the product architecture to configure the affinity and mutual exclusion policies, such as high availability scheduling (availability:az propery), specified node scheduling (affinity and constraint properties) , and specified nodes scheduling (constraint property), for the corresponding business. In this way, restarting nodes cannot cause the exception of a single node. For example, for the database business, we recommend the active-standby or multi-instance deployment, and integrating with the preceding characteristics to make sure that different instances are on different nodes and related nodes are not restarted at the same time.

Best practices

We recommend that you check the high availability configurations of business by reading the preceding instructions. Then, follow these steps in sequence on each node. Do not perform operations on multiple nodes at the same time.

  1. Back up snapshots

    We recommend that you create the latest snapshots for all the related disks of the nodes and then back up the snapshots. When starting the shut-down nodes, an exception occurs because the server is not restarted for a long time and the business availability is impaired. However, by backing up the snapshots, this can be avoided.

  2. Verify the container configuration availability of business

    For a swarm cluster, restarting the corresponding business containers on nodes makes sure that the containers can be pulled up again normally.

  3. Verify the running availability of Docker Engine

    Try to restart Docker daemon and make sure that the Docker Engine can be restarted normally.

  4. Perform related O&M

    Perform the related O&M in the plan, such as updating business codes, installing system patches, and adjusting system configurations.

  5. Restart nodes

    Restart nodes normally in the console or system.

  6. Check the status after the restart

    Check the health status of the nodes and the running status of the business containers in the Container Service console after restarting the nodes.