Configure fault injection - Alibaba Cloud Service Mesh - Alibaba Cloud Documentation Center

Fault injection introduces deliberate failures into a service mesh to improve fault tolerance, discover client-side bugs, and identify potential faults. Unlike network-layer chaos testing (dropping packets or killing pods), fault injection works at the application layer through Envoy proxies, targeting specific failures like HTTP delays or error codes.

Service Mesh (ASM) supports fault injection through VirtualService resources. The following example injects a delay fault into the HTTPBin service and verifies the result.

Fault injection types

ASM supports two fault types, both configured in a VirtualService:

Type	What it simulates	Use case
Delay	Increased network latency or an overloaded upstream service	Test timeout handling and retry logic
Abort	Upstream service failure (HTTP error codes)	Test error handling and fallback behavior

The following example demonstrates delay fault injection. For the full VirtualService fault injection API, see Manage virtual services.

Prerequisites

Complete the preparations and deploy the HTTPBin and sleep services. For more information, see Preparations.

Step 1: Verify that the services are running

Before injecting faults, confirm that the HTTPBin service responds normally.

Use kubectl to connect to your Container Service for Kubernetes (ACK) cluster based on the information in the kubeconfig file, and open a shell in the sleep pod:
```
   kubectl exec -it deploy/sleep -- sh
```

Send a request to the HTTPBin service:

   curl -I httpbin:8000

Expected output:

   HTTP/1.1 200 OK
   server: envoy
   date: Fri, 11 Aug 2023 09:50:24 GMT
   content-type: text/html; charset=utf-8
   content-length: 9593
   access-control-allow-origin: *
   access-control-allow-credentials: true
   x-envoy-upstream-service-time: 3

A 200 OK response with x-envoy-upstream-service-time: 3 confirms that the service responds in about 3 milliseconds with no artificial delay.

Step 2: Inject a delay fault

Create a VirtualService that adds a 5-second delay to all requests to the HTTPBin service.

Apply the following YAML through the ASM console or kubectl. For detailed steps, see Manage virtual services.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: httpbin-vs
  namespace: default
spec:
  hosts:
    - httpbin
  http:
    - fault:
        delay:
          fixedDelay: 5s
          percentage:
            value: 100
      route:
        - destination:
            host: httpbin

Key fields:

Field	Description
`fault.delay.fixedDelay`	Duration of the injected delay. Set to `5s` in this example.
`fault.delay.percentage.value`	Percentage of requests affected. `100` means all requests are delayed.
`hosts`	Target service. Requests to `httpbin` are matched.
`route.destination.host`	Upstream service that receives the request after the delay.

Step 3: Verify the delay fault

After applying the VirtualService, confirm that requests to HTTPBin are now delayed by 5 seconds.

Open a shell in the sleep pod:
```
   kubectl exec -it deploy/sleep -- sh
```

Send a request and measure the total response time:

   curl -w "Total time: %{time_total} seconds\n" -I httpbin:80

Expected output:

   HTTP/1.1 200 OK
   server: istio-envoy
   date: Sun, 27 Aug 2023 12:41:05 GMT
   content-type: text/html; charset=utf-8
   content-length: 9593
   access-control-allow-origin: *
   access-control-allow-credentials: true
   x-envoy-upstream-service-time: 3

   Total time: 5.008333 seconds

Interpret the results

Compare the output from Step 1 and Step 3:

Metric	Before fault injection	After fault injection
Total response time	~3 ms	~5 seconds
Server header	`envoy`	`istio-envoy`
HTTP status	`200 OK`	`200 OK`

The response still succeeds with 200 OK, but the total time increased from milliseconds to about 5 seconds, matching the fixedDelay: 5s configuration. The server header changed to istio-envoy, indicating that the ASM sidecar proxy is actively processing the request and injecting the delay.

Delay fault injection is especially useful for discovering timeout mismatches across a microservice call chain. If an upstream service has a hard-coded timeout shorter than the injected delay, the test exposes that bug before it causes production incidents.

What's next

Adjust the percentage.value field to inject delays on a subset of traffic (for example, set it to 50 to delay 50% of requests).
Replace the delay block with an abort block to return HTTP error codes instead of delays. For the API specification, see Manage virtual services.
Add header-based match conditions to scope fault injection to specific users or traffic subsets, keeping other traffic unaffected.