Best practices for Kubernetes application O&M - Enterprise Distributed Application Service

This topic describes the best practices for performing different operations on an application that is deployed in a Kubernetes cluster, and how to obtain the release notes of Enterprise Distributed Application Service (EDAS) in a timely manner.

Develop an application

You can develop applications by using different developer tools to improve development and deployment efficiency.

Application deployment and joint debugging

Cloud Toolkit
Alibaba Cloud Toolkit is an integrated development environment (IDE) plug-in that Alibaba Cloud provides for developers to develop and deploy cloud applications to EDAS in an efficient way. For more information, see Alibaba Cloud Toolkit overview.
toolkit-maven-plugin
The toolkit-maven-plugin plug-in facilitates the automatic deployment of applications. For more information, see toolkit-maven-plugin overview.

Continuous integration and deployment

Jenkins
Jenkins is an open source tool that helps developers continuously and automatically build and test software projects and monitor the running status of external tasks. For more information, see Jenkins overview.
Apsara DevOps
Alibaba Cloud DevOps can be deployed in public cloud, private cloud, and hybrid cloud. It provides flexible and easy-to-use features of continuous integration, continuous verification, and continuous release to help developers deploy applications to EDAS.

Build a cloud infrastructure environment

Terraform
Terraform is an open source tool that helps developers safely and efficiently pre-configure and manage cloud infrastructure in Alibaba Cloud. For more information, see Terraform overview.

Deploy an application

This section describes only the best practices for deploying an application. For more information about how to deploy an application, see Overview.

Basic application information

If you use images to deploy applications, we recommend that you use Alibaba Cloud Container Registry Enterprise Edition (ACR EE) to manage images.

Compared with Personal Edition, ACR EE has the following benefits:

Improve the efficiency of distributing application images.
Support image scanning to detect vulnerabilities and generate vulnerability reports from multiple perspectives to ensure storage and content security.
Provide network access control for application images to ensure secure access to the application images.

For more information about Container Registry, see What is Container Registry?.

For more information about how to create an application image and upload it to Container Registry, see Create an application image.

Application configuration

We recommend that you configure two or more pod instances for an application.
Multiple pod instances can effectively prevent an application failure from being caused by a failed single pod instance.
We recommend that you properly configure CPU Reserved CPU (Core) and Mem CPU Limit (MB).
Configure reasonable resource quotas and a quality of service (QoS) class for the pods of the application.
Note
For Java applications, the configured limit of memory resources cannot be smaller than the upper limit of memory usage that is configured for the Java virtual machine (JVM). Otherwise, out of memory (OOM) will occur on the pod, and the pod will restart.

Advanced application configuration

Configure scheduling rules.
We recommend that you configure scheduling rules in Deploy in As Many Zones As Possible or Deploy to as Many Nodes as Possible mode for an application. When you deploy an application, use pod anti-affinity to deploy pod instances to as many zones and nodes as possible. This prevents an application failure from being caused by a failed single pod instance. For more information, see Configure scheduling rules.
We recommend that you set a reasonable log rolling policy for an application to clear logs in a timely manner.
If excessive logs are printed on a pod instance, the disk space of the node is occupied. In this case, the node enters the DiskPressure state, and the pod is evicted.
Configure proper application lifecycle probes.
- For self-healing and graceful startup and shutdown of an application, the settings of the Liveness probe need to ensure that the application can start up normally. If the application takes a long time to start up, you can configure a larger value of initial startup delay (InitialDelaySeconds). For more information, see Configure application lifecycle hooks and probes.
- If you deploy a non-microservice application and use a Service to expose application services, the settings of the Readiness probe must ensure that the probe can accurately reflect the health of the application. This way, unhealthy pods cannot provide services. (These pods are not deleted by the Service.) For more information, see Configure application lifecycle hooks and probes and Add a Service.
Configure log collection rules.
Configure log collection rules to store the logs of pod instances to Simple Log Service (SLS) for troubleshooting. Pod instances are stateless and are recreated and deleted during scheduling. Therefore, the logs of pod instances need to be collected in a centralized manner and stored in a persistent manner. For more information, see Configure log collection.

Run an application

This section describes the best practices for running an application. If you want to offer a promotion, configure stress testing and auto scaling for the application in advance.

General running

Exercise caution when you edit the YAML file of an application.
For configuration items that can be modified during EDAS deployment, we recommend that you configure them in EDAS, and not modify them by editing the YAML file. If you edit the resources directly, the application restarts. If you still want to modify the YAML file, you must be familiar with how to configure Deployment resources.
For example, if you modify the native release policy UpdateStrategy of Deployment by editing the YAML file, the modification takes effect only once. When you deploy the application again, the UpdateStrategy (single-batch release, phased release, or canary release) of EDAS is used. Therefore, we recommend that you configure configuration items that can be modified during EDAS deployment in EDAS.
Do not directly modify the listener configuration of a Server Load Balancer (SLB) instance in the SLB console.
To bind an SLB instance to an application, you must modify the configurations of listeners for the SLB instance in the EDAS console. For more information, see Bind CLB instances.
Important
You cannot modify the configurations of the listeners and certificates in the SLB console. Otherwise, the modification may fail, and the application may be inaccessible.
Do not configure internal-facing SLB instances for access to services inside a cluster.
Pod instances cannot directly access internal-facing SLB instances, which are used only for access to different Kubernetes clusters in a Virtual Private Cloud (VPC). To access services within a cluster, create a Service and use the Service address. For more information, see Add a Service.
We recommend that you configure reasonable monitoring and alerting for a cluster.
Generally, you need to configure monitoring and alerting for a cluster, such as Prometheus monitoring. This avoids operational issues caused by unstable underlying resources. For more information about cluster monitoring, see Monitor a Kubernetes cluster.

High-concurrency running

Before high-concurrency running, perform thorough application stress testing.
We recommend that you use PTS to perform end-to-end stress testing for the application. Then, estimate the required number of pods and nodes based on the stress testing result and scale out the application before high-concurrency running. You must also evaluate the storage and network bandwidth to improve the storage space and SLB specifications or the network bandwidth. If application routing is required, check the monitoring metrics of the Ingress controller, check the load, and then set a reasonable number of pod replicas.
- Distribute the pods that run the application in the same zone if possible to avoid latency across zones.
  If pods must be distributed in different zones, you can configure the Intra-zone Provider First feature for the service provider to solve the problem of network latency caused by cross-zone calls. For more information, see Enable the Intra-zone Provider First feature.
- Use the same specifications for cluster nodes that run the application to ensure consistent processing performance of pods.
  If the specifications of Elastic Compute Service (ECS) instances vary greatly, the performance of pods may be inconsistent, and the workloads may be uneven.
  Prepare nodes of the required specifications in advance. If you cannot purchase nodes of the required specifications, contact a solution architect for coordination.
- We recommend that you enable auto scaling for your application.
  We recommend that you directly use an auto scaling rule that is provided by EDAS and triggered based on application metrics or triggered at a specified time. We do not recommend that you enable the Horizontal Pod Autoscaling (HPA) feature of a Deployment separately in Kubernetes to avoid affecting the normal deployment of EDAS applications. For more information about how to use an auto scaling rule that is provided by EDAS and triggered based on application metrics or triggered at a specified time, see Auto scaling.
- We recommend that you bind multiple SLB instances to an application.
  We recommend that you configure the resolution records for multiple SLB instances in the Domain Name System (DNS) to share the load and eliminate the performance bottleneck of a single SLB instance. For more information about how to bind SLB instances to an application, see Bind CLB instances.
  Note
  The maximum bandwidth of an SLB instance with the largest specifications (super slb.s3.large) is 5 Gbit/s.
During high-concurrency running, perform application monitoring and code lockdown.
- Pay attention to relevant application monitoring and cluster monitoring to identify and resolve possible exceptions early. For more information, see Overview.
- During high-concurrency running, the application is being executed. You must strictly control the upload of application code to avoid application deployment updates.

Change an application

This section describes only the best practices for changing an application. You can use application monitoring and application alerting to determine whether an application change succeeds. If the change fails, you can use events and failure analysis to diagnose the application.

Before an application change, reserve resources for the cluster.
Be sure to reserve resources for the cluster. Otherwise, the deployment may be slow or even fail due to insufficient resources. For more information about the causes of application deployment failures, see View the result of a deployment failure.
During an application change, view the change details and application-related metrics.
- View application change records in a timely manner, and view change events and failure analysis if necessary. For more information, see View application changes.
- View the changes in application metrics and release diagnosis reports, and verify your business to check whether the application version meets your expectations.
  If the version does not meet your expectations, roll back the application in a timely manner to restore the application to a stable state. Handle the exception before the next release. Do not continue to release a version if it does not meet your expectations. Otherwise, you cannot roll back to the previous stable state. For more information about how to roll back an application, see Roll back an application in a Kubernetes cluster in the EDAS console.

Troubleshoot an application

This section describes the best practices for troubleshooting an application.

If a pod instance for a Java application fails, you can use Arthas for troubleshooting. For more information, see Diagnostics by Arthas.
If you fail to start up an application, change the startup command to sleep and run kubectl exec to start the process. Then, view the output and analyze the cause. For more information, see Configure a startup command.

Obtain the release notes of EDAS

You can add a release note chatbot of EDAS in a relevant DingTalk group to obtain the release notes of EDAS at the earliest opportunity.