Assistant Engineer
Assistant Engineer
  • UID627
  • Fans3
  • Follows0
  • Posts55

How to improve availability of microservice architecture

More Posted time:Oct 10, 2016 13:32 PM
Several numerals of 9 are used to measure system availability. For instance, 99.99% indicates the system may be unavailable for only 1 hour throughout the entire year. No service has 100% availability, which means failure may occur during operation of the service. If monolithic architecture concentrated in functions and operated in the same application is divided into multiple separate microservice architecture, the risk of global failures can be reduced. However, with the addition of microservices, many dependencies among microservices will become increasingly complex. Moreover, each microservice is prone to failure. Therefore, the microservice architecture may be worse than the monolithic architecture if the dependencies are not properly isolated to avoid chain reaction of failures. Assume there are 100 microservices and only one failure occurs to each microservice, then 2100 different failure scenarios are generated in total, and more than one failure may occur to each microservice. If a failure occurs to one microservice, how can we ensure other dependent microservices are still available, automatic degradation function of the system can remove the failed microservice, and the failure will not spread to the whole system? Therefore, ensuring availability of the microservice architecture will be full of challenges.
The figure below shows a simplified user request, indicating a user request is finished relying on the cooperation of five microservices (pod refers to a set of containers with the same function, as defined in the K8S container framework).

Assume one of the dependent services which are originally normal is now abnormal, then three scenarios may occur:
1. This request succeeded. Assume some node of Service C is unavailable due to abnormal network or down condition and a high availability node replaces the failed node, then Service C is not affected and is still available, as shown in the figure below:

2. This request succeeded. Assume a failure occurs to non-critical Service D, then its operation can still proceed. For instance, if a registered user needs a Service to send a mail indicating successful registration to the user, and the Service is unavailable but will not influence user registration, the registration can be successfully finished and the mail can be delayed and resent after service recovery. Meanwhile, Service A is not affected and still available, as shown in the figure below:

3. This request failed. For instance, if abnormal Service E is classified as a code level logic anomaly, all the high availability nodes become unavailable. Therefore, it is necessary to isolate the dependency of Service E. Otherwise, Service A may be affected and become unavailable. Several measures are required to ensure Service A is not affected and is still available, as shown in the figure below:

Availability of microservice architecture can be improved by:
1) Failover
To improve availability of services, it is essential to eliminate single point of failure and create, using a server load balancer, clusters of which all nodes are stateless and completely equivalent, as stated in Scenario 1 above. If a node is abnormal, a load balancing server can send user's access requests to available nodes. Abnormality of a certain node is insensitive for users, so user requests can be transferred to available nodes transparently.  

2) Asynchronous calls
It is necessary to use asynchronous calls to avoid the failure of all application requests due to the failure of one service, as stated in Scenario 2 above. If synchronous calling is used, abnormal mail service may lead to failure of other services, and further result in user registration failures. If asynchronous calling is used, Service A can send user registration information to a message queue, and then return a successful user registration response immediately. Database programming, access activation and other services can operate normally although the mail service is unavailable. Therefore, failing to send mails has no effect on other services and user registration can be finished successfully.

3) Dependency isolation
User requests are sent to Service A which allocates thread resources to call other services remotely through the network. Assume abnormality occurs when calling Service E, then the threads (in Service A) for calling Service E may respond slowly or become locked. If the threads, as resources of the system, cannot be released within a short period of time, resources may be used up in the case of high concurrency circumstances and therefore Service A may become unavailable, although other services are still available.

Service A has limited resources. For example, Service A allocates 400 threads during startup. If those threads cannot be timely and properly released due to abnormality when calling Service E (such as deadlock and slow response of thread), all 400 threads may be locked when calling Service E. Meanwhile, Service A allocates no idle threads to receive new user requests, thus Service A is suspended or locked. Therefore, it is required to ensure thread resources in Service A are not used up by any called dependent services, thus avoiding Service A being broken by dependent services. Release It! summarizes two important methods: Setting timeout and useing circuit breaker.

Set timeout.
Once execution time of a thread exceeds the time set for service calling in applications, abnormal information will be thrown out and connection is cut off automatically. In this way, the threads are no longer locked in calling services for a over-long period, allowing idle threads to receive new user requests. Such methods can also avoid Service A being broken and becoming unavailable due to abnormality in calling Service E. Therefore, timeout must be set when calling external dependent services through the network.

Use circuit breaker.
Circuit breaker is familiar to everyone. Household electricity meter trips due to current overload or short circuit. If not, the circuit may not be broken and the wire may be heated up, thus causing fire. By using circuit breaker, the electricity meter can trip automatically and break the circuit in case of current overload, thus avoiding any worse disaster. The same is true in applications. If numerous timeout occurs when some dependent services are called, sending new requests may still lead to timeout, wasted existing resources, increased loads and unavailable service, and therefore expected results may not be obtained. Under such circumstances, circuit breaker can be used to avoid such resources waste. A circuit breaker can be installed between a service and a dependent service to monitor the state of access. If the access times out or the failure rate reaches some threshold value (for example, 50% requests are timed out or 20 requests have failed continuously), the circuit breaker is activated and further requests are failed directly instead of waiting for a long period of time. Based on time interval (such as 30s) or request timeout (such as 0%), try to close the circuit breaker (or replace the fuse) to check if the dependency service is recovered.

If a service is dependent on several services, one of which is non-critical and unavailable, setting timeout and using circuit breaker can ensure Service A is normal when it is calling abnormal Service E and Service A can still operate properly, thus realizing dependency isolation effectively. It is shown in the figure below:

4) Set current limit.
During peak periods of service access, numerous concurrencies would occur, which may lead to a poor performance, and even numerous queuing requests that may bring services down. To ensure availability of applications, it is possible to reject low-priority calls and allow requests with high priority to be finished successfully, thus avoiding failure of all calls. A small thread pool should be provided for each dependent service. If the pool is full already, calls may be rejected immediately and queuing is ignored by default, thus facilitating failure identification. Consequently, some users are allowed access while others are rejected. However, rejected users are accepted when accessing again. Such methods can ensure services are available instead of completely unavailable.
Although the above measures are adopted to improve system availability, the system is complex, and even a simple fix may cause unimaginable consequences. Moreover, the system is dynamic. For example, some systems may release several times, or even dozens of times in a day. In this case, failure is still unavoidable. To minimize firefighting in the middle of the night or during holidays, more measures will be developed to improve system availability. For instance, some enterprises may hold fault emergency drills under the production environment on a regular basis. In the past, man-made fault tests were conducted during low peak of business in order to check if high availability solutions are effective, including drills on the host, network, application and storage. Now, fault emergency drills are also conducted gradually during normal production periods to check high availability of the system. The question is that some faults still cannot be recovered during a long period of time in actual circumstances although they are recovered immediately during an emergency drill. Reasons include: 1) The drill is conducted based on a solution prepared for a known scenario; 2) the drill generally covers switches on high availability nodes or disaster recovery systems; 3) the drill is a man-made operation which requires participation of all personnel, which may not happen frequently. However, the system is dynamic, that is, its high availability this time cannot guarantee that the system is always highly available in the next week or month.

After monolithic architecture is divided into microservice architectures, the drills on the application layer becomes more complex. Just as stated above, if one failure is assumed for each service, 2100 different scenarios may occur. Therefore, an automatic fault test method is required to avoid operability of the drill after microservitization. Netflix proposes an automatic fault test solution to improve availability of the microservice architecture. This solution is also implemented under the production environment. However, the final purpose of the test is to ensure services are not stopped under the production environment in case of a true fault, and the whole system can remove fault components properly after degradation without human intervention. They believe that this solution may not be effective in case of true faults under the production environment if the test is only conducted under the test environment and no test is conducted in terms of business stress, business scenarios, environment allocation, network performance and hardware performance. The test is only conducted during working time so that engineers can receive an alarm and give quick responses.

Peter Alvaro, from Netflix, proposed an algorithm named as “Molly” and failure injection testing (FIT) to realize the safely automatic FIT in the Paper Lineage-driven Fault Injection. Molly, based on the trouble-free state of a system, intends to answer "How does the system reach such a trouble-free state?". Simple examples are taken to address principles of this algorithm. By using its tracking system, draw a tree to indicate all microservices handled by each request. Assumptions are shown in the figure below:  

(A or R or P or B)

Originally, four nodes in the figure are all necessary and normal. Then, inversely estimate based on the correct output, select a node randomly for fault injection, and find and create a logical chain diagram supporting its correctness. After faults are injected into the node, three scenarios may be found:

1. This request failed. We have found a node with potential faults which can be removed from future tests.
2. This request succeeded. However, the fault node is not critical.
3. This request succeeded. A high availability node replaces the failure node.
In this instance, a fault is injected into Ratings. However, the request succeeded. It is indicated that the fault of Rating does not affect the services. So, we can remove this node temporarily and redraw the request tree:

(A or P or B) and (A or P or B or R)

Now, we can see the request can be realized by (A or P or B), and (A or P or B or R). Next, inject the fault into Playlist. At this moment, the request still succeeded since the request is forwarded to the backup node. Therefore, a new node is available for access.

(A or PF or B) and (A or P or B) and (A or P or B or R)

Now, the algorithm can be updated to indicate the request can be realized by (A or PF or B), (A or P or B) and (A or P or B or R). Then, the test is repeated until all correct output is tested and no fault node is found.

Molly does not specify how to search spaces. To achieve this, assessment is conducted for all solutions, and then the minimum set of solutions would be selected. For instance, the final solution may be [{A}, {PF}, {B}, {P,PF}, {R,A}, {R,B} …]. Select all single nodes for fault injection first, then select all double nodes for fault injection, and so forth.

The purpose of this test is to find and fix faults before affecting numerous members. Numerous fault nodes are unacceptable when the fault test is conducted under the production environment. To avoid such risk, this test is only established within a specified scope involving two concepts: failure scope and injection points. Failure scope, namely limits the potential influence of one fault test to a controllable scope ranging from a certain user or device to 1% of all users. Injection points, namely fault components to be expected in the system such as the RPC layer, caching layer or persistent layer. The process diagram of the test is shown in the figure below:

Fault simulation test intends to inject fault simulation metadata to Zuul from FIT. If the request is identified within failure scope, the injection failed. Such faults may include service calling delay or failing to reach the persistent layer. For each injection point reached, check whether it is the component to be injected with fault under context of the request. If yes, inject the simulation fault into this point.