Distributed systems are complex, which brings risks to the stability of infrastructure, application logic, and O&M. This may lead to failures in business systems. Therefore, it is important to build fault tolerance in distributed systems. This topic describes how to use Alibaba Cloud Service Mesh (ASM) to configure the timeout processing, retry, bulkhead, and circuit breaking mechanisms to build fault tolerance in distributed systems.

Background information

Fault tolerance refers to the ability of a system to continue to run during partial failures. To create a reliable and resilient system, you must make sure that all services in the system are fault-tolerant. The dynamic nature of the cloud environment requires services to proactively anticipate failures and gracefully respond to unexpected incidents.

Each service may have failed service requests. Appropriate measures must be prepared to handle the failed service requests. The interruption of a specific service may cause knock-on effects and lead to serious consequences for your business. Therefore, it is necessary to build, test, and use the resiliency of the system. ASM provides a fault tolerance solution that supports the timeout processing, retry, bulkhead, and circuit breaking mechanisms. The solution brings fault tolerance to applications without modifying the code of the applications.

Timeout processing

When a client sends a request to an upstream service, the upstream service may not respond. You can set a timeout period. If the upstream service does not respond to the request within the timeout period, the client considers the request a failure and no longer waits for a response from the upstream service.

After a timeout period is set, an application receives a return error if the backend service does not respond within the timeout period. Then, the application can take appropriate fallback actions. The timeout setting specifies the time that the requesting client waits for a service to respond to the request. The timeout setting does not affect the processing behavior of the service. Therefore, a timeout does not mean that the requested operation fails.

ASM allows you to configure a timeout policy for a route in a virtual service to set a timeout period. If a sidecar proxy does not receive a response within the timeout period, the request fails. After you set the timeout period for a route, the timeout setting applies to all requests that use the route.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - 'httpbin'
  http:
  - route:
    - destination:
        host: httpbin
    timeout: 5s

timeout: specifies the timeout period. If the requested service does not respond within the specified timeout period, an error is returned, and the requesting client no longer waits for a response.

Retry

If a service encounters a request failure, such as request timeout, connection timeout, or service breakdown, you can configure the retry mechanism to request the service again.
Notice Do not make frequent retries or specify a long timeout period for each retry. Otherwise, cascading system failures may occur. If many connection attempts are made while the service is recovering, the service may be under more pressure. In addition, repeated connection attempts may overwhelm the service, causing potential problems to become more severe.
ASM allows you to create a virtual service to define a retry policy. In this example, a virtual service is created to define the following retry policy: When a service requests the httpbin application, if the httpbin application does not respond or the response fails, the service requests the httpbin application again for a maximum of three times. The timeout period for each request is 5 seconds.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - 'httpbin'
  http:
  - route:
    - destination:
        host: httpbin
    retries:
      attempts: 3
      perTryTimeout: 5s
  • attempts: specifies the maximum number of retries.
  • perTryTimeout: specifies the timeout period for each retry. Unit: milliseconds, seconds, minutes, or hours.

Bulkhead pattern

A bulkhead pattern limits the maximum numbers of connections and access requests that a client can initiate to a service to avoid excessive access to the service. If a specified threshold is exceeded, new requests are disconnected. A bulkhead pattern helps isolate resources that are used in services and prevent cascading system failures. The maximum number of concurrent connections and the timeout period for each connection are common connection settings that are valid for both TCP and HTTP. The maximum number of requests per connection and the maximum number of request retries are valid only for HTTP1.1, HTTP2, and Google Remote Procedure Call (gRPC) connections.

ASM allows you to create a destination rule to configure a bulkhead pattern. In this example, a destination rule is created to define the following bulkhead pattern: When a service requests the httpbin application, the maximum number of concurrent connections is 1, and the maximum number of requests per connection is 1. In addition, the service receives a 503 error if no connection to the httpbin application is established within 10 seconds.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
      tcp:
        connectTimeout: 10s
        maxConnections: 1
  • http1MaxPendingRequests: specifies the maximum number of request retries.
  • maxRequestsPerConnection: specifies the maximum number of requests per connection.
  • connectTimeout: specifies the timeout period for each connection.
  • maxConnections: specifies the maximum number of concurrent connections.

Circuit breaking

The circuit breaking mechanism works in the following way: If Service B does not respond to a request from Service A, Service A stops sending new requests but checks the number of consecutive errors that occur within a specified period of time. If the number of consecutive errors exceeds the specified threshold, the circuit breaker disconnects the current request. In addition, all subsequent requests fail until the circuit breaker is closed.

ASM allows you to create a destination rule to configure the circuit breaking mechanism. In this example, a destination rule is created to define the following circuit breaking mechanism: If a service fails to request the httpbin application for three consecutive times within 5 seconds, the current request is disconnected in 5 minutes.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    outlierDetection:
      consecutiveErrors: 3
      interval: 5s
      baseEjectionTime: 5m
      maxEjectionPercent: 100
  • consecutiveErrors: specifies the number of consecutive errors.
  • interval: specifies the time interval for ejection analysis.
  • baseEjectionTime: specifies the minimum ejection duration.
  • maxEjectionPercent: specifies the maximum percentage of hosts that can be ejected from a load balancing pool.