15 Minutes to Achieve Lossless Online and Offline of Enterprise-level Applications

Enterprise-level applications Introduction:

15 minutes to achieve lossless online and offline of enterprise-level applications.Many application systems with a large number of users and a high degree of concurrency generally choose to publish in the middle of the night when the traffic is small in order to avoid the loss of traffic during the publishing process. Although this is effective, it is uncontrollable and leads to the R&D, operation and maintenance costs behind it. For enterprises It's not a small burden. Based on this, during the application release process, Alibaba Cloud 's microservice engine MSE provides microservices through adaptive waiting + active notification when the application is offline, readiness check when the application is online, alignment with the life cycle of microservices + service warm-up and other technical means . The service application has a lossless online and offline function, which can effectively help enterprises avoid the traffic loss caused by online publishing.

Enterprise-level applications.Many application systems with a large number of users and a high degree of concurrency generally choose to publish in the middle of the night when the traffic is small in order to avoid the loss of traffic during the publishing process. Although this is effective, it is uncontrollable and leads to the R&D, operation and maintenance costs behind it. For enterprises It's not a small burden. Based on this, during the application release process, Alibaba Cloud 's microservice engine MSE provides microservices through adaptive waiting + active notification when the application is offline, readiness check when the application is online, alignment with the life cycle of microservices + service warm-up and other technical means . The service application has a lossless online and offline function, which can effectively help enterprises avoid the traffic loss caused by online publishing.

Enterprise-level applications Lossless online and offline function design


Common causes of traffic loss include but are not limited to the following:

Service cannot be offline in time: The service consumer perceives that there is a delay in the service list of the registry center, which causes the service consumer to still call the offline application for a period of time after the application is offline, resulting in a request error.

Enterprise-level applications.Slow initialization: The application has just started to receive online traffic for resource initialization and loading. Due to the large traffic, the initialization process is slow, and a large number of request-response timeouts, blocking, and resource exhaustion occur, causing the application just started to crash .

Too early registration: The service has an asynchronous resource loading problem. When the service has not been initialized completely, it is registered in the registry, resulting in slow request-response and call timeout error when the resource is not loaded.

Enterprise-level applications.Release state and the running state are not aligned : The rolling release function of Kubernetes is used for application release. Due to the readiness check mechanism generally associated with the rolling release of Kubernetes, the next batch is triggered by checking whether the application-specific port is activated as a sign of application readiness. Instances are published, but in microservice applications, service calls can only be provided externally when the application has completed service registration. Therefore, in some cases , the new application may not be registered in the registry, and the old application instance will be offline, resulting in no service available.

Enterprise-level applications Lossless offline


One of the services cannot be offline in time, as shown in Figure 1 below:

Figure 1. Spring Cloud application consumers cannot sense provider service offline in time

For Spring Cloud applications, when the two instances of the application, A ' and A in A, go offline, because the Spring Cloud framework balances availability and performance, the consumer defaults to 30s to go to the registry to pull the latest service list. Therefore, the offline of the A instance cannot be sensed in real time. At this time, if the consumer continues to call A through the local cache, there will be traffic loss when calling the offline instance.

In response to this problem, the lossless offline function designed and implemented by Alibaba Cloud Microservice Engine MSE based on Java Agent bytecode technology is shown in Figure 2 below:

In this lossless offline solution, the service provider application only needs to access the MSE, compared with the general lossy offline. There will be an adaptive waiting period before the application goes offline. At this time, the application that is expected to go offline will send an offline event to the service consumer who has sent the request during the adaptive waiting phase through active notification, and the consumer receives the offline event. After the event, the registry service instance list will be actively pulled in order to sense the application offline event in real time, so as to avoid the loss of application offline traffic caused by calling the offline instance.

Lossless online

Lazy loading is the most common strategy in software framework design. For example, in the Spring Cloud framework, the initial timing of the pull service list of the Ribbon component is to wait until the first invocation of the service by default. For example, Figure 3 below is in the Spring Cloud application. Time-consuming requests for the first and second calls to remote services through RestTemplate :

It can be seen from the test results that the first call takes several times the normal time due to some resource initialization. Therefore, when a new application is released online to directly handle large traffic, it is very likely that a large number of requests will be slow to respond, resources will be blocked, and application instances will be down . In response to the slow initialization of application resources under such large traffic, the low-traffic warm-up function provided by MSE helps protect the new instance by adjusting the traffic allocated by the just-launched application to process normal traffic after sufficient warm-up. The small flow preheating process is shown in Figure 4 below:

Caused by the slow initialization of the first call of the above application , MSE also provides resource pre-established connections, delayed registration, ensuring that service registration is completed before the Kubernetes readiness check is passed, and ensuring that the Kubernetes readiness check is completed before the service is warmed up. Wait for a complete set of lossless online means to meet the lossless online requirements of various applications. The complete solution is shown in Figure 5:

Enterprise-level applications.How to use MSE's lossless online and offline


The best practices of the lossless online and offline and service warm-up capabilities provided by the Alibaba Cloud microservice engine MSE when the application is released. It is assumed that the architecture of the application consists of the Zuul gateway and the back -end microservice application instance (Spring Cloud). The specific back-end call links include shopping cart application A, transaction center application B, and inventory center application C. The services in these applications are registered and discovered through the Nacos registry.

Preconditions

Enable MSE Microservice Governance

•A Kubernetes cluster has been created, see Creating a Kubernetes Managed Cluster [1] .
•MSE Microservice Governance Professional Edition has been activated. For details, please refer to Activating MSE Microservice Governance [ 2] .

Ready to work

Note that the Agent used in this practice is still in grayscale, and the application Agent needs to be upgraded in grayscale, and the upgrade document:
https://help.aliyun.com/document_detail/392373.html


If the application is deployed in different Regions (for now, only domestic Regions are supported), please use the corresponding Agent download address:
http://arms-apm-cn-[regionId].oss-cn-[regionId].aliyuncs.com/2.7.1.3-mse-beta/
Pay attention to replace [ RegionId ] in the address, RegionId is Alibaba Cloud RegionId .

For example, the Region Beijing Agent address is:
http://arms-apm-cn-beijing.oss-cn-beijing.aliyuncs.com/2.7.1.3-mse-beta/

Application Deployment Traffic Architecture Diagram

Flow pressure source


In the spring-cloud- zuul application, as shown in Figure 6, it makes service calls to the gray-scale version and the normal version of spring-cloud-a at a rate of 100 QPS at the same time.


Deploy the demo application


Save the following content to a file, assuming the name is mse-demo.yaml , and execute kubectl apply -f mse-demo.yaml to deploy the application to the pre-created Kubernetes cluster (note that there are CronHPA tasks in the demo , so please install the ack-kubernetes-cronhpa-controller component in the cluster first , specifically search for the component in Container Service-Kubernetes->Market->Application Directory to install it in the test cluster), here we will deploy Zuul , A, B and C three applications, of which two applications A and B deploy a baseline version and a gray version respectively. The baseline version of application B has the lossless offline capability turned off, and the gray version has the lossless offline capability enabled. The C application has the service warm-up capability enabled, and the warm-up time is 120 seconds.

# Nacos Server
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nacos-server
name: nacos-server
spec:
replicas: 1
selector:
matchLabels:
app: nacos-server
template:
metadata:
labels:
app: nacos-server
spec:
containers:
- env:
- name: MODE
value: standalone
image: registry.cn-shanghai.aliyuncs.com/yizhan/nacos-server:latest
imagePullPolicy: Always
name: nacos-server
resources:
requests:
cpu: 250m
memory: 512Mi
dnsPolicy: ClusterFirst
restartPolicy: Always

# Nacos Server Service 配置
---
apiVersion: v1
kind: Service
metadata:
name: nacos-server
spec:
ports:
- port: 8848
protocol: TCP
targetPort: 8848
selector:
app: nacos-server
type: ClusterIP

#入口 zuul 应用
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spring-cloud-zuul
spec:
replicas: 1
selector:
matchLabels:
app: spring-cloud-zuul
template:
metadata:
annotations:
msePilotAutoEnable: "on"
msePilotCreateAppName: spring-cloud-zuul
labels:
app: spring-cloud-zuul
spec:
containers:
- env:
- name: JAVA_HOME
value : /usr/lib/jvm/java-1.8-openjdk/jre _ _ _
- name: LANG
value: C.UTF-8
image : registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-zuul:1.0.1
imagePullPolicy : Always
name: spring-cloud- zuul
ports:
- containerPort : 20000

# A Apply the base version and enable full-link transparent transmission according to machine latitude
---
apiVersion : apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-a
name: spring-cloud-a
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-a
template:
metadata:
annotations:
msePilotCreateAppName: spring-cloud-a
msePilotAutoEnable: "on"
labels:
app: spring-cloud-a
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: profiler.micro.service.tag.trace.enable
value: "true"
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-a:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-a
ports:
- containerPort: 20001
protocol: TCP
resources:
requests:
CPU : 250m
memory: 512Mi
livenessProbe :
tcpSocket :
port: 20001
initialDelaySeconds : 10
periodSeconds : 30

# A Apply the gray version and enable full-link transparent transmission according to machine latitude
---
apiVersion : apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-a-gray
name: spring-cloud-a-gray
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-a-gray
strategy:
template:
metadata:
annotations:
alicloud.service.tag: gray
msePilotCreateAppName: spring-cloud -a
msePilotAutoEnable: "on"
labels:
app: spring-cloud-a-gray
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: profiler.micro.service.tag.trace.enable
value: "true"
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-a:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-a-gray
ports:
- containerPort: 20001
protocol: TCP
resources:
requests:
CPU : 250m
memory: 512Mi
livenessProbe :
tcpSocket :
port: 20001
initialDelaySeconds : 10
periodSeconds : 30

# B Apply the base version, turn off the lossless offline ability
---
apiVersion : apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-b
name: spring-cloud-b
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-b
strategy:
template:
metadata:
annotations:
msePilotCreateAppName: spring-cloud-b
msePilotAutoEnable: "on"
labels:
app: spring-cloud-b
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: micro.service.shutdown.server.enable
value: "false"
- name: profiler.micro.service.http.server.enable
value: "false"
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-b:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-b
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
CPU : 250m
memory: 512Mi
livenessProbe :
tcpSocket :
port: 20002
initialDelaySeconds : 10
periodSeconds : 30

# B applies the gray version, and the lossless offline function is enabled by default
---
apiVersion : apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-b-gray
name: spring-cloud-b-gray
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-b-gray
template:
metadata:
annotations:
alicloud.service.tag: gray
msePilotCreateAppName: spring-cloud-b
msePilotAutoEnable: "on"
labels:
app: spring-cloud-b-gray
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-b:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-b-gray
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- >-
wget http://127.0.0.1:54199/offline 2>/tmp/null;sleep
30;exit 0
livenessProbe:
tcpSocket:
port: 20002
initialDelaySeconds: 10
periodSeconds: 30

# C 应用 base 版本
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: spring-cloud-c
name: spring-cloud-c
spec:
replicas: 2
selector:
matchLabels:
app: spring-cloud-c
template:
metadata:
annotations:
msePilotCreateAppName: spring-cloud-c
msePilotAutoEnable: "on"
labels:
app: spring-cloud-c
spec:
containers:
- env:
- name: LANG
value: C.UTF-8
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
image: registry.cn-shanghai.aliyuncs.com/yizhan/spring-cloud-c:0.1-SNAPSHOT
imagePullPolicy: Always
name: spring-cloud-c
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
livenessProbe:
tcpSocket:
port: 20003
initialDelaySeconds: 10
periodSeconds: 30

#HPA 配置
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
labels:
controller-tools.k8s.io: "1.0"
name: spring-cloud-b
spec:
scaleTargetRef:
apiVersion: apps/v1beta2
kind: Deployment
name: spring-cloud-b
jobs:
- name: "scale-down"
schedule: "0 0/5 * * * *"
targetSize: 1
- name: "scale-up"
schedule: "10 0/5 * * * *"
targetSize: 2
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
labels:
controller-tools.k8s.io: "1.0"
name: spring-cloud-b-gray
spec:
scaleTargetRef:
apiVersion: apps/v1beta2
kind: Deployment
name: spring-cloud-b-gray
jobs:
- name: "scale-down"
schedule: "0 0/5 * * * *"
targetSize: 1
- name: "scale-up"
schedule: "10 0/5 * * * *"
targetSize: 2
---
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
labels:
controller-tools.k8s.io: "1.0"
name: spring-cloud-c
spec:
scaleTargetRef:
apiVersion: apps/v1beta2
kind: Deployment
name: spring-cloud-c
jobs:
- name: "scale-down"
schedule: "0 2/5 * * * *"
targetSize : 1
- name: "scale-up"
schedule: "10 2/5 * * * *"
targetSize : 2


# The zuul gateway opens SLB to expose the display page
---
apiVersion : v1
kind: Service
metadata:
name: zuul-slb
spec:
ports:
- port: 80
protocol: TCP
targetPort: 20000
selector:
app: spring-cloud-zuul
type: ClusterIP

# a 应用暴露 k8s service
---
apiVersion: v1
kind: Service
metadata:
name: spring-cloud-a-base
spec:
ports:
- name: http
port: 20001
protocol: TCP
targetPort: 20001
selector:
app: spring-cloud-a

---
apiVersion: v1
kind: Service
metadata:
name: spring-cloud-a-gray
spec:
ports:
- name: http
port: 20001
protocol: TCP
targetPort: 20001
selector:
app: spring-cloud-a-gray

# Nacos Server SLB Service 配置
---
apiVersion: v1
kind: Service
metadata:
name: nacos-slb
spec:
ports:
- port: 8848
protocol: TCP
targetPort : 8848
selector:
app: nacos- server
type: LoadBalancer

Result Verification 1 : Enterprise-level applications.Lossless offline function

Since we have enabled timed HPA for both spring-cloud-b and spring-cloud-b-gray applications, we simulate a timed expansion and contraction every 5 minutes .

Log in to the MSE console and enter the Microservice Governance Center -> Application List -> spring-cloud-a-> Application Details. From the application monitoring curve, we can see the traffic data of the spring-cloud-a application:

The traffic of the gray version has 0 request errors during the process of pod expansion and contraction, and there is no traffic loss. In the unmarked version, since the lossless offline function is disabled, 20 requests sent from spring-cloud-a to spring-cloud-b are reported with errors during the process of pod expansion and contraction , resulting in request traffic loss.


Result verification 2: Enterprise-level applications.service warm-up function

To 1 node at the 2nd minute and 0th second in the expansion and shrinkage cycle, and at the 2nd minute and 10th second. Scale up to 2 nodes .

Enable the service warm-up function on spring-cloud-b on the consumer side of the warm-up application.

On the service provider side of the warm-up application, spring-cloud-c enables the service warm-up function. The warm-up time is configured to be 120 seconds.

Observe the traffic of the node and find that the traffic of the node increases slowly. And you can see the warm-up start and end time of the node, as well as related events.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00