ChaosBlade x SkyWalking: High Availability Microservices Practices

By Ye Fei (GitHub Account: @tiny-x), an open-source community enthusiast and ChaosBlade committer, participating in promoting the ecosystem construction of chaos engineering in ChaosBlade.

Preface

In a distributed system architecture, it is difficult to assess the impact of a single fault on the entire system. This is because there are a wide variety of service components and intricate dependencies between services, and the request procedure is long. Imperfect basic services, such as monitoring alarms and log records, may also cause difficulties in fault response and troubleshooting. Therefore, it is a great challenge to build a highly available distributed system.

Chaos Engineering (CE) was developed specifically to overcome these challenges. Through injecting faults into the system in a controllable range or environment, developers can observe the system behavior and locate the system defects. By doing so, the ability and confidence to deal with chaos in a distributed system due to unexpected conditions are developed. Thus, the stability and high availability of the system are continuously improved.

The implementation process of Chaos Engineering is to formulate a chaos experiment plan, define steady metrics, make assumptions of fault-tolerant behavior in the system, execute chaos experiment, check steady metrics in system, and so on. For this reason, the entire chaos experiment process requires reliable, easy-to-use, and scenario-rich chaos experiment tools to inject faults. Complete distributed procedure tracking and system monitoring tools are also required. Tracking and monitoring tools aim to trigger the emergency response and early warning scheme, then to quickly locate the fault, and observe various system data metrics during the whole process.

In this article, we will introduce the chaos experiment tool ChaosBlade and the distributed system monitoring tool SkyWalking. We will also discuss about the high-availability microservices practices of ChaosBlade and SkyWalking through a microservices case.

Tool Introduction

ChaosBlade

ChaosBlade is a chaos engineering tool that follows the principles of chaos engineering experiments. It provides extensive fault scenarios to help distributed systems improve fault tolerance and recoverability. It can also implement the injection of underlying-layer faults, and ensure business continuity during migration to the cloud or Cloud Native system. ChaosBlade is easy-to-use, non-invasive, and highly scalable. It can continuously improve system stability and availability through fault injection within a controllable range or environment.

ChaosBlade is easy to use and supports a wide range of experiment scenarios, including the following:

Basic resources: Experimental scenarios such as CPU, memory, network, disk, and process.
Java applications: Databases, caches, messages, JVMs, and microservices. Any methods can be specified to be injected with fault.
C++ application: Scenarios like injecting delay, variables, and tampered returned values to specified methods or rows of code.
Docker container: Experimental scenarios such as disabling of containers, or CPU, memory, network, disk, and process in containers.
Cloud Native platform: Experimental scenarios on Kubernetes such as the CPU, memory, network, disk, and process. Pod network and Pod disabling. The experiment scenario of container as shown above.

ChaosBlade encapsulates scenarios into individual projects according to the domains. This not only realizes the scenario standardization in the domain, but also facilitates horizontal and vertical scaling of scenarios. By following the chaos experiment model, chaosblade CLI is called in a unified way.

SkyWalking

SkyWalking is an open source APM system that includes monitoring, tracing, and diagnosis for distributed systems in Cloud Native architecture.

The core features are as follows:

Analysis of services, service instances, and endpoint metrics
Root cause analysis
Service topology analysis
Analysis of services, service instances, and endpoint dependencies
Slow service and endpoint detection
Performance optimization
Distributed tracing and context propagation
Detection of database access metrics and slow database access statements (including SQL statements)
Alerts

Tool Installation and Usage

It is very easy and convenient to install and use ChaosBlade. Various ChaosBlade scenarios are uniformly called through chaosblade CLI. After downloading and decompressing the corresponding tar package, the blade executable file is then used to perform chaos experiments. For downloading tar package, see: https://github.com/chaosblade-io/chaosblade/releases

ChaosBlade Installation

The actual environment in this article is Linux-AMD64. Download the latest chaosblade-linux-amd64.tar.gz package and perform the following steps:

## Download
wget https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz
## Decompress 
tar -zxf chaosblade-0.9.0-linux-amd64.tar.gz
## Set environment variables 
export PATH=$PATH:chaosblade-0.9.0/
## Test 
blade –h

ChaosBlade Usage

After the installation, the blade executable file is only required to create chaos experiments for all supported scenarios. First, use the blade -h command to check the service. After selecting the sub-command, the -h can be used downward to see the complete usage case and the detailed description of each parameter:

1) How to Use the Blade

The blade -h command can be executed to check which commands are supported:

An easy to use and powerful chaos engineering experiment toolkit

Usage:
  blade [command]

Available Commands:
  create      Create a chaos engineering experiment
  destroy     Destroy a chaos experiment
...

2) Create Experiment Scenarios

For example, to create a full-load CPU scenario, the blade create cpu fullload –h command can be executed to check specific scenario parameters and select relevant parameters to run the command:

Create chaos engineering experiments with CPU load

Usage:
  blade create cpu fullload

Aliases:
  fullload, fl, load

Examples:

# Create a CPU full load experiment
blade create cpu load

#Specifies two random kernel's full load
blade create cpu load --cpu-percent 60 --cpu-count 2
...

Flags:
      --blade-release string     Blade release package，use this flag when the channel is ssh
      --channel string           Select the channel for execution, and you can now select SSH
      --climb-time string        durations(s) to climb
      --cpu-count string         Cpu count
      --cpu-list string          CPUs in which to allow burning (0-3 or 1,3)
      --cpu-percent string       percent of burn CPU (0-100)
...

3) Resume the Experiment

ChaosBlade supports three methods to resume an experiment as shown in the followings:

After an experiment is successfully created, ChaosBlade returns a UID. The blade destroy uid command can be executed to resume the experiment.
Execute the blade destroy target action (such as "blade destroy cpu fullload") if no corresponding UID is available.
Add the "--timeout 10" parameter when creating an experiment. The experiment automatically resumes after being executed for ten seconds. Besides, the parameter can act as an expression, such as "--timeout 30m" for 3 minutes.

SkyWalking Installation and Usage

For more information about SkyWalking installation and usage, see: https://github.com/apache/skywalking/tree/v8.1.0/docs

After the tool is deployed, it's time to introduce the building of a highly available microservices system by taking cases as references. Through fault injection and system behavior observation, ChaosBlade and SkyWalking can locate problems and discover system defects to build a highly available microservices system.

Case on Application Fault Tolerance

In this case, a microservice application is deployed for experiments, and A/B testing is used to simulate system requests. The microservice application services include the frontend, shopping cart, recommendation service, product, order, and so on. The components include Springboot, Nacos, MySQL, Redis, Lettuce, Dubbo are used. ChaosBlade supports most of the application components. ChaosBlade is used to inject chaos experiments to verify the application fault tolerance capability. SkyWalking is used for application monitoring and problem locating.

Case Environment

Linux-AMD64, version CentOS-7.x
JDK 1.8
chaosblade-0.9.0. Download address: https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz
skywalking-apm-8.1.0. Download address: https://www.apache.org/dyn/closer.cgi/skywalking/8.1.0/apache-skywalking-apm-8.1.0.tar.gz

Application Topology

The overall architecture of the application is as follows. The frontend calls cart and products through the powerful dependency of Dubbo.

Chaos Experiment Steps

Develop a chaos experiment plan
Define system steady metrics
Make assumptions about system fault tolerance behavior
Run chaos experiment
Check steady metrics
Record and resume chaos experiment
Fix the problems
Automated Continuous verification

Next, a chaos experiment will be performed using ChaosBlade according to the steps.

Case 1

1) Scenario

Create a chaos experiment plan, and then call downstream services to frequently delay data. Use A/B testing to simulate normal interface access to the cart. Start two threads, and perform interface access by 10,000 times.

ab -n 10000 -c 2 http://127.0.0.1:8083/cart

2) Monitoring metrics

Define system steady metrics, and select /cart endpoint in the SkyWalking console. The steady metrics are as follows:

The average response time (RT) was around 15 ms.
P99 metric is within 20 ms.

3) Expectation Assumption

Set the timeout period for calls to avoid client request blocking for a long time.
Configure the service blow policy/service degradation.

4) Chaos Experiment

The ChaosBlade installation and its simple usage are described in the previous section. Now, a latency fault (30 seconds of delay) is injected into the downstream Dubbo cart service by ChaosBlade. Then, blade create dubbo delay –h command is executed to check how Dubbo calls the delay:

Dubbo interface to do delay experiments, support provider and consumer

Usage:
  blade create dubbo delay

Examples:
# Invoke com.alibaba.demo.HelloService.hello() service, do delay 3 seconds experiment
blade create dubbo delay --time 3000 --service com.alibaba.demo.HelloService --methodname hello --consumer

Flags:
      --appname string          The consumer or provider application name
      --consumer                To tag consumer role experiment.
      --effect-count string     The count of chaos experiment in effect
      --effect-percent string   The percent of chaos experiment in effect
      --group string            The service group
  -h, --help                    help for delay
      --methodname string       The method name
      --offset string           delay offset for the time
      --override                only for java now, uninstall java agent
      --pid string              The process id
      --process string          Application process name
      --provider                To tag provider experiment
      --service string          The service interface
      --time string             delay time (required)
      --timeout string          set timeout for experiment in seconds
      --version string          the service version

Global Flags:
  -d, --debug        Set client to DEBUG mode
      --uid string   Set Uid for the experiment, adapt to docker

According to the case and parameter explanation, the upstream service client needs to inject a latency fault (30 seconds of latency). With SkyWalking, the Dubbo service information on the procedure can be easily found. Search for the procedure with the endpoint /cart first, and find the Dubbo service, as shown in the following figure:

Procedure search

Receive detailed information about the protocol.

Click in to check the detailed span information of the Dubbo service. After obtaining the URL of the Dubbo service, the parameters required to use ChaosBlade to inject the upstream service delay are obtained. Therefore, our final parameter structure is:

--time 30000: 30s of delay
--service com.alibabacloud.hipstershop.cartserviceapi.service.CartService: Service
--methodname viewCart: Service method
--process frontend: Java process
--consumer: Currently a Dubbo service client

Issue command and inject fault:

blade create dubbo delay --time 30000 --service com.alibabacloud.hipstershop.cartserviceapi.service.CartService --methodname viewCart --process frontend --consumer

5) Monitoring Metrics

Check the system metrics after fault injection, and check the metrics on SkyWalking:

The average RT is about 2,000 ms, and the P99 metric is about 2,000 ms.

An error is reported on /cart interface calling that the "com.alibabacloud.hipstershop.cartserviceapi.service.CartService" service is abnormal.

A timeout error occurs. The timeout period is 2,000 ms.

Conclusion: The upstream service is configured with the call timeout period, but without service blow policy, which is actually not as expected.

Policy	Problem Description	Expectation
Call timeout period	The client can receive a response in two seconds at most.	Expected
Service blow policy/service degradation	The response time of each client request is more than two seconds, and the page returns the unreadable error message to the users.	Not as expected

6) Problem Fix

Configure the service blow policy/service degradation.

5. Case 2

1) Scenario

During running, Dubbo service provider fails to access the registry. The fault of network packet loss (100%) is injected in the registry.

2) Monitoring Metrics

Define system steady metrics and select service endpoints in the SkyWalking console. The steady metrics are as follows:

The "com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart" service is normal.

3) Expectation Assumption

Upstream and Downstream services will not be affected.

4) Chaos Experiment

Inject packet loss fault (100%) in registry port. In this case, Nacos is used as the registry for Dubbo. The default port is 8848, and the network interface card (NIC) is eth0. The command parameters are as follows:

--interface eth0: NIC
--percent 100: 100% of packet loss rate
--local-port: Local port 8848

Issue command and inject fault:

blade create network loss --interface eth0 --percent 100 --local-port 8848

5) Monitoring Metrics

Select the service endpoint in the SkyWalking console after fault injection. Steady metrics are as follows:

The "com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart" service is normal.

Conclusion: The service is weakly dependent on the registry and the service itself has a local cache, which is in line with the expected assumption.

Policy	Problem description	Expectation
Local cache for the service	Upstream and downstream services are not affected.	Expected

Assume that the application is now deployed in a Kubernetes cluster. The horizontal scaling capability of the registry can be verified. ChaosBlade also supports Kubernetes cluster scenarios. .

Simple Practice

In the above cases, it has been tested whether the service is configured with timeout and blow policies. The weak dependency of Dubbo on the registry and the local cache for the service are also verified. Can't wait to experience it in your own system, right? ChaosBlade provides a wide range of experiment scenarios for everyone. It not only supports basic resources and applications, but also is a powerful tool for Cloud Native platforms. ChaosBlade is user-friendly and provides detailed parameters to control the minimum explosion radius of the fault. ChaosBlade will make it very easy for everyone to get started.

It is not enough to only talk on paper. Here an additional simple case is provided for everyone to practice. We often deal with relational databases in application development. When the application traffic increases rapidly, bottlenecks often occur on the database side, resulting in a lot of slow SQLs. When there is no slow SQL alert, it is difficult to find the original SQL for optimization. Therefore, slow SQL alert is very important. To verify whether an application supports this capability, ChaosBlade injects slow SQL fault of MySQL. Then, it executes "blade create mysql delay –h" command to see how MySQL calls the delay:

Mysql delay experiment

Usage:
  blade create mysql delay

Examples:
# Do a delay 2s experiment for mysql client connection port=3306 INSERT statement
blade create mysql delay --time 2000 --sqltype select --port 3306

Flags:
      --database string         The database name which used
      --effect-count string     The count of chaos experiment in effect
      --effect-percent string   The percent of chaos experiment in effect
  -h, --help                    help for 
      --host string             The database host
      --offset string           delay offset for the time
      --override                only for java now, uninstall java agent
      --pid string              The process id
      --port string             The database port which used
      --process string          Application process name
      --sqltype string          The sql type, for example, select, update and so on.
      --table string            The first table name in sql.
      --time string             delay time (required)
      --timeout string          set timeout for experiment in seconds

Global Flags:
  -d, --debug        Set client to DEBUG mode
      --uid string   Set Uid for the experiment, adapt to docker

As shown, ChaosBlade provides a complete example and supports parameters with a smaller granularity, such as SQL types and table names. Try to perform a 10s of delay for the "select" operation when port 3306 is connected. Is there an alert in your application when the traffic hits?

blade create mysql delay --time 10000 --sqltype select --port 3306

Command parameter explanation:

--time 10000: 10s of delay
--sqltype select: Only select type of SQL statements is supported.
--port 3306: Only connections to port 3306 are supported.

Summary

This article describes the application of Chaos Engineering in complex distributed architectures. It also introduces chaos experiments with ChaosBlade and SkyWalking to analyze and optimize the system, based on the fault performance. Thus, the stability and high availability of the system is continuously improved. ChaosBlade not only supports basic resources and applications, but also serves as a useful tool on the Cloud Native platform. You are welcomed to use it.

ChaosBlade project is available at this address: https://github.com/chaosblade-io/chaosblade. You are welcome to join us and work together! Visit the following page for the Contribution Guide.

Community

ChaosBlade x SkyWalking: High Availability Microservices Practices

Preface

Tool Introduction

ChaosBlade

SkyWalking

Tool Installation and Usage

ChaosBlade Installation

ChaosBlade Usage

1) How to Use the Blade

2) Create Experiment Scenarios

3) Resume the Experiment

SkyWalking Installation and Usage

Case on Application Fault Tolerance

Case Environment

Application Topology

Chaos Experiment Steps

Case 1

1) Scenario

2) Monitoring metrics

3) Expectation Assumption

4) Chaos Experiment

5) Monitoring Metrics

6) Problem Fix

5. Case 2

1) Scenario

2) Monitoring Metrics

3) Expectation Assumption

4) Chaos Experiment

5) Monitoring Metrics

Simple Practice

Summary

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Dikky Ryan Pratama May 9, 2023 at 5:51 am

Alibaba Cloud Native Community

Related Products

Container Service for Kubernetes

Microservices Engine (MSE)

Managed Service for Prometheus

ACK One