My observation and thinking on cloud native software architecture

In the article "Cloud Native Infrastructure", we talked about that cloud native computing includes three dimensions: cloud native infrastructure, software architecture, and delivery and operation and maintenance system. This article will focus on the software architecture level.

In my understanding, the main goal of software architecture is to solve the following challenges:

1. Control complexity. Due to the complexity of the business, we need to use better means to help R&D organizations overcome cognitive obstacles and better division of labor and collaboration. This is true of divide and conquer, separation of concerns, etc.

2. Deal with uncertainty. Business is moving fast and needs are constantly changing. Even with the perfect software architecture, as time goes by and the team changes, adjustments to the software architecture are inevitable. When reading "Design Patterns", "Microservice Design" and other books, the word "decoupling" is written between the lines. Let us pay attention to the separation of certainty and uncertainty in the architecture, and improve the stability and adaptability of the architecture.

3. Manage systemic risks. Manage certain and uncertain risks in the system, avoid known pitfalls, and prepare for unknown risks.

The goal of cloud-native application architecture is to build a loosely coupled, elastic, and resilient distributed application software architecture that can better respond to changes and developments in business requirements and ensure system stability. This article will share my observations and thoughts in this field.

Origination - 12 Elements Application
In 2012, Heroku founder Adam Wiggins published the Twelve-Factor Application Manifesto. It defines some basic principles and methodologies that need to be followed to build an elegant Internet application, and has also widely influenced many microservice application architectures. The twelve elements focus on: the healthy growth of applications, effective collaboration among developers, and avoiding the effects of software architecture corruption. Its content is worthy of serious understanding by every student today.

The 12-Factors App provides us with great architectural guidance to help us:

1. Build a horizontally scalable elastic application architecture to better support Internet-scale applications.

2. Improve the standardization and automation level of the R&D process and improve R&D efficiency.

3. Reduce the difference between the development environment and the production environment, and use continuous delivery to implement agile development.

4. Improve the portability of applications, suitable for cloud deployment, and reduce resource costs and management complexity.

Loosely coupled architecture design

The core concept of microservices is that each service in the system can be independently developed, independently deployed, and independently upgraded, and each service is loosely coupled. The concept of cloud-native application architecture is to further emphasize the loose coupling of the architecture and reduce the degree of interdependence between services.

API-first application architecture design
In the object-oriented software architecture, the most important thing is to define the object and the interface contract of the object. The SOLID principles are the most widely known design principles.

Single responsibility principle - single responsibility principle

Open/closed principle - Open/closed principle

Liskov substitution principle - Liskov substitution principle

Interface segregation principle - Interface segregation principle

Dependency inversion principle - Dependency inversion principle

Putting together the English initials of the above five principles is the SOLID principle, which also helps us build a high-cohesion, low-coupling, and flexible application architecture. In the distributed microservice application architecture, API first is a natural extension of contract first.

API should be given priority in design: we know that user needs are complex and changeable, for example, from desktop to mobile, the presentation and operation process of the application may be different; however, the conceptual model of business logic and service interaction are relatively stable . Relatively speaking, the API interface is more stable, while the specific implementation can be iteratively implemented and continuously changed. A well-defined API can better guarantee the quality of the application system.

APIs should be declarative, descriptive/self-describing: through standardized descriptions, APIs are easy to communicate, understand, and verify, and simplify development collaboration. Supports parallel development of service consumers and providers, accelerating the development cycle. Support the implementation of different technology stacks, for example, for the same API interface, its service implementation adopts Java. Front-end applications can use JavaScript, while server-side applications can use Golang for service calls and so on. This gives development organizations the flexibility to choose the right technology based on their skill stack and system requirements.

API should have SLA: API, as an integration interface between services, is closely related to the stability of the system. SLAs should be considered as part of API design, not after deployment. In a distributed system, stability risks are ubiquitous. Through the API-first design pattern, we carry out stability architecture design and capacity planning for independent services; we can also perform fault injection and stability drills on independent APIs to Eliminate systemic stability risks.

In the world of APIs, the most important trend is the rise of standardized technologies. gRPC is a high-performance, general-purpose, platform-independent RPC framework open sourced by Google. It adopts a layered design, and its data exchange format is developed based on the Protobuf (Protocol Buffers) protocol. It has excellent serialization/deserialization efficiency and supports many development languages. In the transport layer protocol, gRPC chooses HTTP/2, which has greatly improved its transmission efficiency compared with HTTP/1.1. In addition, HTTP/2, as a mature open standard, has rich security and flow control capabilities, as well as good interoperability. gRPC can not only be used for server-side service calls, but also support the interaction between browsers, mobile apps, and IoT devices and back-end services. gRPC already has complete RPC capabilities in function, and also provides an extension mechanism to support new functions.

Under the trend of Cloud Native, the demand for interoperability between cross-platform, cross-vendor, and cross-environment systems will inevitably lead to RPC technology based on open standards, and gRPC conforms to the historical trend and has been more and more widely used. In the field of microservices, Dubbo 3.0 announced support for the gRPC protocol. In the future, we will see more microservice architectures developed based on the gRPC protocol and provide good multi-language support. In addition, in the field of data services, gPRC has also become an excellent choice. You can refer to Alluxio's article:

In addition, in the API field, Swagger (OpenAPI specification) and GraphQL are open standards worthy of everyone's attention. You can choose flexibly according to your own business requirements, so this article will not repeat them.

The Rise of Event Driven Architecture
Talking about event-driven architecture (EDA - Event Driven Architecture), let's first explain what an event is. Events are records of things that have happened, state changes, etc. They are immutable (cannot be changed or deleted) and are sorted in the order in which they were created. Interested parties can get notified about these state changes by subscribing to published events, and then use business logic of their choice to act on this information.

Event-driven architecture is an architectural approach for building a loosely coupled microservice system. Microservices interact through asynchronous event communication.

The event-driven architecture realizes the complete decoupling of event producers and consumers. Producers don't need to pay attention to how events are consumed, and consumers don't need to pay attention to how events are produced; we can dynamically add more consumers without affecting producers, and add message middleware to dynamically route and convert events. This also means that event producers and consumers do not have timing dependencies. Even if messages cannot be processed in time due to application downtime, the program can continue to obtain these events from the message queue and continue to execute after recovery. Such a loosely coupled architecture provides higher agility, flexibility, and robustness for the software architecture.

Another important advantage of the event-driven architecture is to improve the scalability of the system. Event producers will not be blocked while waiting for event consumption, and can use Pub/Sub mode to allow multiple consumers to process events in parallel.

Event-driven architecture also integrates perfectly with Function as a Service (FaaS). The event-triggered function executes business logic, and the "glue code" that integrates multiple services can also be written in the function, and the application of the event-driven architecture can be built simply and efficiently.

But there are still many challenges in the EDA architecture.

1. The distributed loosely coupled architecture greatly increases the complexity of the application infrastructure. The cloud-based deployment delivery method and cloud services (message queue, function computing service, etc.) can further improve the stability, performance and cost-effectiveness of the architecture.

2. Compared with traditional synchronous processing methods, asynchronous event processing has requirements related to event sequencing, idempotency, callback and exception handling, and the overall design is more difficult.

3. In most cases, maintaining data consistency is very challenging due to the lack of distributed transaction support across multiple systems. Developers may need to balance availability and consistency. For example, event sourcing is used to achieve eventual consistency.

4. Interoperability. In the real world, events are ubiquitous, yet different producers describe events differently. Developers want to be able to build event-driven applications in a consistent manner regardless of where the events originate. CloudEvents is a specification that describes event data in a common and consistent manner. It was proposed by the CNCF Severless working group to improve the portability of event-driven applications. Currently, event processing middleware such as Alibaba Cloud EventBridge and Azure Event Grid, as well as FaaS technologies such as Knative Eventing and Alibaba Cloud Function Computing, have provided support for CloudEnvents.

Due to the advantages of EDA's own architecture, it has very broad prospects in Internet application architecture, business data and intelligence, IoT and other scenarios. The discussion on the architecture of EDA will not continue here.

Delivery Oriented Application Architecture

In the cloud-native software architecture, we not only pay attention to how the software is built during the design phase, but also need to start with the end in mind. Focus on how to design and implement software reasonably so that it can be delivered and maintained better.

Decoupling applications and operating environments

In 12-factor applications, the decoupling of applications and operating environments has been proposed. The emergence of Docker containers has further strengthened this idea. Container is a lightweight application virtualization technology. The operating system kernel is shared between containers and supports second-level startup. Docker container image is a self-contained application packaging format that combines applications and their dependencies (such as system libraries and configuration files) etc. are packaged together to maintain deployment consistency in different environments.

Containers can be used as the basis of Immutable Infrastructure (immutable infrastructure) to improve the stability of application delivery. Immutable infrastructure is a concept proposed by Chad Fowler in 2013: In this model, any instance of infrastructure (including various software and hardware such as servers and containers) becomes a read-only state once created and cannot be changed. make any changes to it. If some instances need to be modified or upgraded, a batch of new instances are created to replace them. This mode can reduce the burden of configuration management work, ensure that system configuration changes and upgrades can be reliably repeated, and avoid troublesome configuration drift problems; it is easy to resolve differences between deployment environments, making the process of continuous integration and continuous deployment easier. more smoothly; supports better version management, and can be rolled back quickly when deployment errors occur.

As a distributed orchestration and scheduling system for containers, Kubernetes further improves the portability of container applications. Through a series of abstractions such as Loadbalance Service, Ingress, CNI, and CSI, K8s helps business applications to shield the implementation differences of the underlying infrastructure and migrate flexibly. Through such capabilities, we can realize the dynamic migration of workloads in data centers, edge computing, and cloud environments.

In the application architecture, we need to avoid coupling static environment information, such as IP, mac address, etc., with application logic. In the microservice architecture, Zookeeper/Nacos can be used to realize the registration and discovery of services; in Kubernetes, we can reduce the dependence on the service endpoint IP through Service and Service Mesh. In addition, the persistence of the application state is also implemented through distributed storage or cloud services as much as possible, which can greatly improve the scalability and self-healing capabilities of the application architecture.

self-contained observability

One of the biggest challenges faced by distributed systems is observability. Observability can help us understand the current state of the system and serve as the basis for application self-healing, elastic scaling, and intelligent operation and maintenance.

In the cloud-native architecture, microservice applications are self-contained and should be self-observable and can be easily managed and detected by the system. The first is that the application should have the ability to visualize its own health status.

In Kubernetes, business applications can provide a liveness probe, which can detect application readiness through TCP, HTTP or command line. For HTTP type probes, Kubernetes will visit the address regularly, if the return code of the address is not between 200 and 400, the container is considered unhealthy, and the container will be killed to rebuild a new container;

For applications that start slowly, to avoid importing traffic before the application startup is complete. Kubernetes supports business containers to provide a readiness probe. For HTTP type probes, Kubernetes will visit this address from time to time. If the return code of this address is not between 200 and 400, it is considered that the container cannot provide external services and will not schedule the request. to the container;

At the same time, observable probes have been built into the new microservice architecture. For example, two new actuator addresses have been released in SpringBoot 2.3, /actuator/health/liveness and /actuator/health/readiness. The former is used as a survival probe , which is used as a readiness probe. Business applications can read, subscribe, and modify Liveness State and Readiness State through the Spring system event mechanism, which allows the Kubernetes platform to perform more accurate self-healing and traffic management.

In addition, application observability includes three key capabilities: logging, monitoring, and link tracing.

1. Logging – log (event flow): used to record discrete events, including detailed information about a certain point or stage of program execution. It includes not only the logs of the application and OS execution process, but also the log information during the operation and maintenance process, such as operation auditing.

2. Metrics – monitoring indicators: usually fixed types of time series data, including Counter, Gauge, Histogram, etc., which are aggregateable data. The monitoring capabilities of the system are multi-layered, including monitoring indicators of computing, storage, network and other infrastructure service levels, as well as performance monitoring and business indicator monitoring of business applications.

3. Tracing - link tracking - records the complete processing flow of a single request, which can provide developers of distributed applications with complete call link restoration, call request statistics, application dependency analysis and other capabilities, which can help developers quickly analyze And diagnose performance and stability bottlenecks in distributed application architectures.

In a distributed system, issues such as stability, performance, and security may occur anywhere, requiring full-link observability guarantees, covering different levels such as the infrastructure layer, PaaS layer, and applications, and being able to communicate between different systems Enables correlation, aggregation, query, and analysis of observability data.

The observability field of software architecture has broad prospects and many technological innovations have emerged. In September 2020, CNCF published a technology radar for cloud-native observability.

Among them, Prometheus has become one of the open source monitoring tools preferred by enterprises for cloud native applications. Prometheus fosters an active developer and user community. In the Spring Boot application architecture, by introducing the dependency of micrometer-registry-prometheus, the monitoring indicators of the application can be collected by the Prometheus service.

In the field of distributed tracing, OpenTracing is an open source project under CNCF. It is a technology-neutral distributed tracing specification that provides a unified interface that facilitates developers to integrate one or more distributed tracing implementations in their own services. Jaeger is Uber's open source distributed tracking system, compatible with the OpenTracing standard, and has successfully graduated from CNCF. In addition, OpenTelemetry is a potential standard, which tries to integrate the two projects of OpenTracing and OpenCensus to form a unified technical standard.

For many legacy business systems, existing applications do not have complete observability capabilities. Emerging service mesh technologies can be a new way to improve system observability. Through the request interception of the data plane proxy, the grid can obtain the performance indicators of inter-service calls. In addition, the service caller application only needs to add the message header that needs to be forwarded, and the complete link tracking information can be obtained on the service grid. This method greatly simplifies the construction of observability capabilities, allowing existing applications to be integrated into the cloud-native observability system at low cost.

Alibaba Cloud provides a wealth of observability capabilities. XTrace distributed tracing provides support for the OpenTracing/OpenTelemetry standard. ARMS provides a managed Prometheus service, which allows developers to avoid the high availability and capacity challenges of the system. Observability is the foundation of AIOps and will play a more important role in the future enterprise IT application architecture.

Design for Failure - Design For Failure
According to "Murphy's Law" — "Anything that can go wrong will go wrong". Distributed systems may be subject to factors such as hardware, software, or internal and external human damage. Cloud computing provides higher SLA and more secure infrastructure than self-built data centers, but we still need to always pay attention to system availability and potential "black swan" risks when designing application architecture.

Systematic stability requires overall consideration in terms of software architecture, operation and maintenance system, and organizational guarantee. At the architectural level, Alibaba Economy has very rich experience, such as defensive design, current limiting, downgrading, fault isolation, etc., and has also contributed excellent open source projects such as Sentinel and ChaosBlade to the community.

In this article, I will talk about a few places that can be further thought about in the cloud-native era. My summary is "Failures can and will happen, anytime, anywhere. Fail fast, fail small, fail often and recover quickly."

The first is "Failures can and will happen", we need to improve the replaceability of servers. There is a very popular metaphor in the industry: "Pets vs. Cattle", pets and livestock. We are faced with an architectural choice: do we need to carefully serve the server where the application is located to prevent system downtime and rescue at all costs after a problem occurs (Pet); ). The suggestion of cloud-native architecture is to allow failure to occur, to ensure that each server and each component can fail without affecting the system and has self-healing and replaceable capabilities. The basis of this design principle is the decoupling of application configuration and persistent state from the specific operating environment. The automated operation and maintenance system of Kubernetes makes server replaceability easier.

Also "Fail fast, fail small, recover quickly". Fail fast is a very counter-intuitive design principle. The philosophy behind it is that since failures cannot be avoided, the earlier the problem is exposed, the easier it is for the application to recover, and the fewer problems enter the production environment. After adopting the Fail-fast strategy, our focus will shift from how to exhaust the problems in the system to how to quickly discover and handle failures gracefully. As long as I run fast enough, the fault can't catch up with me. :-) In the R&D process, use integration testing to find application problems as early as possible. At the application level, modes such as Circuit Breaker can be used to prevent a local failure of a dependent service from causing global problems; in addition, through the health monitoring and observability of K8s, the detection of application failures can be realized, and the circuit breaker of the service grid Functions, the capabilities of fault discovery, traffic switching, and fast self-healing can be externalized outside the application implementation and guaranteed by system capabilities. The essence of Fail small is to control the scope of influence of the failure—the blast radius. This principle requires our continuous attention in both architecture design and service design.

The last is "Fail often". Chaos engineering is an idea that periodically introduces failure variables in the production environment to verify the effectiveness of the system against unexpected failures. Netflix introduced the concept of chaos engineering to solve the stability challenges of the microservice architecture, and it has also been widely used by many Internet companies. In the cloud-native era, there are more new methods. Kubernetes allows us to easily inject faults, kill pods, and simulate application failure and self-healing processes. Using the service grid, we can perform more complex fault injection on traffic between services. For example, Istio can simulate failure scenarios such as slow response and service call failure, helping us verify the coupling between services and improve system stability.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us