Distributed systems are complex, which brings risks to the stability of infrastructure, application logic, and O&M. This may lead to failures in business systems. Therefore, it is important to build fault tolerance in distributed systems. This topic describes how to use Alibaba Cloud Service Mesh (ASM) to configure the timeout processing, retry, bulkhead, and circuit breaking mechanisms to build fault tolerance in distributed systems.
Fault tolerance refers to the ability of a system to continue to run during partial failures. To create a reliable and resilient system, you must make sure that all services in the system are fault-tolerant. The dynamic nature of the cloud environment requires services to proactively anticipate failures and gracefully respond to unexpected incidents.
Each service may have failed service requests. Appropriate measures must be prepared to handle the failed service requests. The interruption of a specific service may cause knock-on effects and lead to serious consequences for your business. Therefore, it is necessary to build, test, and use the resiliency of the system. ASM provides a fault tolerance solution that supports the timeout processing, retry, bulkhead, and circuit breaking mechanisms. The solution brings fault tolerance to applications without modifying the code of the applications.
When a client sends a request to an upstream service, the upstream service may not respond. You can set a timeout period. If the upstream service does not respond to the request within the timeout period, the client considers the request a failure and no longer waits for a response from the upstream service.
After a timeout period is set, an application receives a return error if the backend service does not respond within the timeout period. Then, the application can take appropriate fallback actions. The timeout setting specifies the time that the requesting client waits for a service to respond to the request. The timeout setting does not affect the processing behavior of the service. Therefore, a timeout does not mean that the requested operation fails.
ASM allows you to configure a timeout policy for a route in a virtual service to set a timeout period. If a sidecar proxy does not receive a response within the timeout period, the request fails. After you set the timeout period for a route, the timeout setting applies to all requests that use the route.
apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: httpbin spec: hosts: - 'httpbin' http: - route: - destination: host: httpbin timeout: 5s
timeout: specifies the timeout period. If the requested service does not respond within the specified timeout period, an error is returned, and the requesting client no longer waits for a response.
apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: httpbin spec: hosts: - 'httpbin' http: - route: - destination: host: httpbin retries: attempts: 3 perTryTimeout: 5s
- attempts: specifies the maximum number of retries.
- perTryTimeout: specifies the timeout period for each retry. Unit: milliseconds, seconds, minutes, or hours.
A bulkhead pattern limits the maximum numbers of connections and access requests that a client can initiate to a service to avoid excessive access to the service. If a specified threshold is exceeded, new requests are disconnected. A bulkhead pattern helps isolate resources that are used in services and prevent cascading system failures. The maximum number of concurrent connections and the timeout period for each connection are common connection settings that are valid for both TCP and HTTP. The maximum number of requests per connection and the maximum number of request retries are valid only for HTTP1.1, HTTP2, and Google Remote Procedure Call (gRPC) connections.
ASM allows you to create a destination rule to configure a bulkhead pattern. In this example, a destination rule is created to define the following bulkhead pattern: When a service requests the httpbin application, the maximum number of concurrent connections is 1, and the maximum number of requests per connection is 1. In addition, the service receives a 503 error if no connection to the httpbin application is established within 10 seconds.
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: httpbin spec: host: httpbin trafficPolicy: connectionPool: http: http1MaxPendingRequests: 1 maxRequestsPerConnection: 1 tcp: connectTimeout: 10s maxConnections: 1
- http1MaxPendingRequests: specifies the maximum number of request retries.
- maxRequestsPerConnection: specifies the maximum number of requests per connection.
- connectTimeout: specifies the timeout period for each connection.
- maxConnections: specifies the maximum number of concurrent connections.
The circuit breaking mechanism works in the following way: If Service B does not respond to a request from Service A, Service A stops sending new requests but checks the number of consecutive errors that occur within a specified period of time. If the number of consecutive errors exceeds the specified threshold, the circuit breaker disconnects the current request. In addition, all subsequent requests fail until the circuit breaker is closed.
ASM allows you to create a destination rule to configure the circuit breaking mechanism. In this example, a destination rule is created to define the following circuit breaking mechanism: If a service fails to request the httpbin application for three consecutive times within 5 seconds, the current request is disconnected in 5 minutes.
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: httpbin spec: host: httpbin trafficPolicy: outlierDetection: consecutiveErrors: 3 interval: 5s baseEjectionTime: 5m maxEjectionPercent: 100
- consecutiveErrors: specifies the number of consecutive errors.
- interval: specifies the time interval for ejection analysis.
- baseEjectionTime: specifies the minimum ejection duration.
- maxEjectionPercent: specifies the maximum percentage of hosts that can be ejected from a load balancing pool.