Kubernetes Stability Assurance Handbook – Part 3: Observability

Part 3 of this 3-part series discusses the concept of observability, problem domains, and requirements at different levels based on the Kubernetes Stability Assurance Handbook.

By Wupeng

Series about Kubernetes Stability Assurance Handbook:

Kubernetes Stability Assurance Handbook - Highlights
Kubernetes Stability Assurance Handbook - Logs
Kubernetes Stability Assurance Handbook - Observability (this article)

With the increasing emphasis on stability and the popularity of community observability projects, observability has become a hot topic. People have different understandings from different perspectives.

A macro understanding of observability is formed starting from the lifecycle of software development. Besides, the understanding and practice of observability can be determined from the perspectives of SRE and Serverless.

Objectives

Enhance cognition and competitiveness by grasping the overall situation
Bring possibilities for the future through reasonable design and practice

Goals

Agreement on the understanding of observability
Agreement on the development direction of observability

What Is Observability?

Wikipedia defines observability as, "In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs."

Consider a physical system modeled in a state-space representation. A system is said to be observable if, for any possible evolution of state and control vectors, the current state can be estimated only using the information from outputs. Physically, this generally corresponds to information obtained by sensors. In other words, one can determine the behavior of the entire system from the system's outputs. On the other hand, if the system is not observable, there are state trajectories that are not distinguishable by only measuring the outputs.

In short, observability is a method to derive the internal state of the system from the external output of the system.

The following figure simplifies the system composition and interaction between systems:

From the interaction diagram above, the interaction behavior of the system has the following forms:

Internal System
Components feature a closed loop that does not interact with other components or systems
Interaction between components
Between systems
Interaction between systems

The internal status of the system can be understood through the external output of the system through the following two forms of information:

Information of the component closed loop
Information flowing between components or systems

What Is the Problem Domain of Observability?

The core of observability is to meet the needs of different people to understand the state of the system through observational data. The lifecycle of the observation data is abstracted on the following diagram:

Observational data is generated by applications, stored after intermediate processing, and queried for consumers.

Observational data serves different types of consumers, such as product users, businesses, R&D personnel, and site reliability engineers (SREs). Different consumers use the data in different forms, including SLA, SLO, SLI, and Alert.

Based on the lifecycle of observational data, the problem domains of observability are roughly summarized below:

Generation

Data Model of Observational Data
Generation of Observational Data
Export of Observational Data

Processing

Collection of Observational Data
Processing of Observational Data
Export of Observational Data

Storage

Storage of Observational Data
Query of Observational Data
Use of Observational Data

Use

Consumption of Observational Data

What is the Service Goal of Observability in the Software Development Lifecycle?

From the project perspective, the software development lifecycle involves the following steps:

Refine the Steps:

There are four types of roles in the software development lifecycle. The observability objectives of the four roles are different:

Note:

Reliability is not the same as stability. Reliability contains features of stability and timely meeting functional requirements.

Directions That Can Be Invested in SRE

Basic Services:

OpenTelemetry can be used as a basis to implement the items above. For more details, please see: A Brief Look at OpenTelemetry.

Additionally, visual stability assurance services can be explored, which can help discover, locate, and solve problems quickly from a global perspective. The diagram below shows the health status of components themselves and interactions between them:

On this basis, an overall view of the cluster status can be kept. Exception information can be associated as well, in turn solving problems in a targeted manner.

Observability In Serverless Scenarios

Serverless computing is a promising cloud computing execution model. Alibaba Cloud provides various related products:

One of the main differences between different Serverless computing environments is the duration of the runtime environment. Starting from this, the core of observability in the Serverless computing environment can be abstracted, and then the corresponding solutions can be raised:

Depending on the persistence of the runtime environment, the execution duration can be divided into three types:

Within a few days
Within a few hours
Within a few minutes or seconds

All of these runtime environments can be implemented using technologies, such as virtual machines, containers, and WebAssembly. The difference lies in the duration of the runtime environment defined by the business layer.

The core concerns of the platform and users may change depending on the duration of the running environment:

Under the runtime environment within a few days, platforms focus on providing reliable runtime environments for users to manage their applications freely.
For observability, reliability of the runtime environment is the core concern of platforms, while the stability of the application environment and response performance of requests remain the focus of users.
Under the runtime environment within a few hours, platforms focus on providing management services around applications, while users pay more attention to their business.
For observability, the running stability of applications and response performance of requests is the core concern of platforms, while business features remain the focus of users.
Under the runtime environment within a few minutes or seconds, platforms focus on managing business logic for fine-grained users, while users pay more attention to sensitive features of the business.
For observability, request response reliability and business features are the core concern of platforms, while core business features remain the focus of users.

For the FaaS scenario, the demo of Thundra shows a good example for reference. Three examples are truncated as below:

Function

Application

Architecture

Summary

An in-depth understanding of the concept of observability, problem domains, and requirements at different levels can help deepen your appreciation for observability. Based on the appreciation, it is integrated with the business to enhance the competitiveness of the business in terms of observability, along with iterative understanding, where technology and business are mutually reinforcing.

Community

Kubernetes Stability Assurance Handbook – Part 3: Observability

Objectives

Goals

What Is Observability?

What Is the Problem Domain of Observability?

What is the Service Goal of Observability in the Software Development Lifecycle?

Directions That Can Be Invested in SRE

Observability In Serverless Scenarios

Summary

References

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Cloud-Native Applications Management Solution

Container Service for Kubernetes

ACK One

Function Compute