Developer Toolchain for Platform Engineering
Since the birth of Kubernetes, cloud native represented by technologies such as DevOps, containerization, observability, microservices, and serverless has given birth to a new round of upgrades in the application architecture. Interestingly, unlike the previous iterative update of technology, it was originally a routine technical practice in a technology circle. Under the background of the digital transformation of thousands of industries, under the double influence of the continuous impact of the epidemic, coupled with the technology independent policies of some traditional industries Catalyzed; this technical iteration has almost become a feast for IT practitioners to participate in. However, the steep learning curve, complex technical system, and transient basic resource form have brought many challenges to the development, construction, delivery, operation and maintenance of the enterprise's information system construction. These challenges have also deepened the contradiction between the increasingly updated technology stack and developers who are accustomed to focusing on front-line business development. This contradiction directly gave birth to the recent concept of platform engineering.
This concept grasps this conflict point and puts forward the concept of "internal R&D self-service platform": "Enterprises should provide a series of self-service tools in the form of platform construction to assist developers in solving various problems encountered in each link. technical problem". This statement hit the itch of many developers, which is one of the reasons why this concept suddenly became popular. Behind the concept came a more direct question: what should be in this tool?
Lift the veil of the problem
As early as 2018, the EDAS production and research team visited a customer with a research and development team of 100 people. At that time, the customer was splitting microservices and migrating to the cloud. They encountered some new problems:
• Due to dependency problems, the complete environment cannot be started locally, making development difficult.
• The invocation relationship of the cloud environment is complicated and cannot be debugged.
The customer expects to start a specific instance locally, the cloud can be transferred to the local, and the local to the cloud, so as to realize the joint debugging of the micro-service end and the cloud. We didn't have any preparations for these problems, so we quickly started research and analysis after returning, and slowly uncovered the iceberg hidden under the water.
The customer's request is very simple, that is, to apply the microservice locally, and the application can call each other with the cloud microservice, as shown below:
After migrating to the cloud, the customer's gateways, applications, message queues, caches, databases and other component modules are deployed in the cloud network, and can only be accessed locally through the bastion host. In this case, it is impossible for the local node to start normally (because it cannot connect to key components such as the database), and it is also impossible to call each other with the cloud service. After adopting the cloud-native architecture, customers can no longer go back to the original simple and efficient local development method.
Is this problem only encountered by this customer? No, we have talked with many customers over the past few years and they all have this problem. But in fact, it is not that there is no solution at all. The most direct way is to connect the local office network with the cloud network by setting up a private network to achieve network interoperability. But in fact, there are three major flaws in this method, resulting in not many customers adopting it.
• High cost: The investment in building a private line network is quite large, which is not cost-effective compared to the benefits.
• Low security: Connecting the local network to the cloud brings instability to both the local office network and the cloud production network, essentially expanding the security domain and expanding the attack surface.
• Complicated O&M: Network O&M is quite complex. Balancing local and cloud networks under a highly scalable cloud-native architecture is a nightmare for many network engineers. The local and cloud networks must be well planned, and the network segments on both sides must not conflict. At the same time, bidirectional network routing and security policies require manual management, which is complex, laborious and prone to problems.
For these problems, some companies take a compromise approach, find a machine in the cloud as a VPN server, and build a VPN link from the local to the cloud. This solution also needs to maintain network routing to achieve network interoperability. In addition, OpenVPN is cheap but unstable, and dedicated VPN has high performance but is expensive. You can't have both.
After realizing these problems, we started the road of exploration of "the road is long and the road is long".
Device-Cloud Interconnection, Initial Answers to Questions
At the beginning, we set two goals: one is to open up local and cloud links in two directions, and the other is not to need to make drastic changes to the network architecture.
After three months of closed research and development, we developed this tool at the end of 2018. It supports two-way communication, and is plug-in out of the box, supporting Windows and MacOS systems. We named it the device-cloud interconnection, and its overall composition is as follows:
The terminal-cloud interconnection plug-in will start a sidecar process--channel service when starting the microservice. The channel service is responsible for receiving the traffic of the local microservice and forwarding it to the cloud target microservice through the bastion host. The three core points are further explained below.
Local microservice calls forwarded to sidecar
We use Java's native traffic proxy technology. By injecting startup parameters, the local microservice traffic can be forwarded to the channel service sidecar through the socks protocol. For specific parameter details, read Java Networking and Proxies for details.
sidecar forwards calls to cloud microservices
In fact, SSH itself can be used for data forwarding, acting as a forward proxy and a reverse proxy. The SSH protocol is divided into three layers: transport layer, authentication layer and connection layer from bottom to top:
• Transport Layer Protocol (Transport Layer Protocol): This layer protocol is responsible for establishing a secure connection channel and is the cornerstone of the entire SSH security.
• User Authentication Protocol (User Authentication Protocol): This protocol is responsible for remote authentication.
• Connection Protocol: This layer of protocol realizes the multiplexing and information exchange of SSH channels, and can realize various functions such as remote command execution and data forwarding through it.
We use the SSH connection protocol to make the channel service sidecar forward calls to cloud microservices. Although the underlying principle of SSH is a bit complicated, the upper layer is quite simple to use, and mainstream programming languages basically have ready-made libraries for use.
Cloud microservice calls are forwarded to the bastion host
Here we use the microservice mechanism to register the IP and specific port of the bastion host as the address information of the local microservice to the registration center. In this way, when the cloud microservice is called, the bastion host will be discovered through the registration center, and a service request will be initiated, and combined with SSH data forwarding, it can return to the local microservice.
People look for him thousands of Baidu
After the device-cloud interconnection tool was launched, it was welcomed by many customers, but customers encountered new problems in the process of using it:
• NIO traffic proxy problem: The traffic proxy parameters in Java Networking and Proxies are only valid for BIO traffic, and the NIO framework does not support it. This problem has a very large impact, because microservice applications basically use the Java NIO framework directly or indirectly. For a specific and simple example, Netty itself is based on Java Nio, and many popular middleware frameworks use Netty as a transmission framework.
• Domain name resolution: For domain name resolution, local microservice applications will initiate domain name resolution before accessing, and these domain name resolutions also do not support Java traffic proxy. In other words, if the domain name can only be resolved in the cloud, the entire call will fail. For example, the service domain name in K8s can only be resolved by DNS on the cluster nodes, and cannot be resolved locally, which makes it impossible to call the K8s service locally.
For these problems, the industry usually uses containers to solve them (such as Telepresence), by running the application in the container, and then combining with iptables to intercept and forward the traffic in the entire container. This method is feasible, and we also support it. The overall architecture is as follows:
This link is similar to before, except that the container technology is introduced locally. In the above figure, we use docker network connect to enable the application container and sidecard container to share the network stack, so that the channel service can intercept and forward the traffic of the local microservice process (including NIO and DNS traffic) through iptables.
The plan is beautiful, but the reality is very skinny, and many customers cannot use it. The reason is that local development needs to introduce the heavyweight dependency of the container. There are two problems here, one is "heavy" and the other is "dependency". "Heavy" because the computing power of the local development machine is very small. Applications such as Chrome, IDE, and communication software often occupy most of the local machine resources. In this context, restarting the container locally often leads to a crash. In addition, mainstream operating systems such as Windows and MacOS do not come with container software such as Docker. Developers need to install it locally by themselves. Due to network bandwidth and other reasons, the entire installation process will encounter many problems.
Besides using containers, is there any other way to solve the application traffic proxy problem? There are still, but the limitations are very large. For example, the tool torsocks can intercept and forward TCP and DNS traffic at the process level, but the problem is that it does not support Windows systems. MacOS is also problematic. Since torsocks is based on the LD_PRELOAD/DYLD_INSERT_LIBRARIES mechanism to rewrite system calls for traffic interception, and the MacOS system itself has system integrity protection that prevents specific system calls from being rewritten, so not all traffic can be intercepted.
Isn't there a better solution?
The man is there, in the dimly lit place
To recap the problem we faced: Java native traffic proxy does not support NIO and DNS traffic forwarding. Here is a very important message - Java. The traffic interception solutions in the industry or the open source community generally pursue versatility, which introduces container dependencies, and loses sight of the other.
Since there are many problems in the pursuit of universality, is there a better solution to focus on the Java language? The answer is yes. Java can dynamically modify the runtime behavior of the application through the Agent bytecode technology without any changes to the upper layer code. For example, link tracking tools such as Pinpoint and SkyWalking implement non-intrusive link burying by injecting a bytecode agent. Another example is the popular Arthas tool in the field of diagnosis, which is also based on bytecode technology to implement call tracking and profiling of Java processes.
So, we began to explore solutions based on bytecode technology. The whole process of exploration is difficult and interesting. At the microservice framework level, we need to adapt to mainstream frameworks such as SpringCloud, Dubbo, HSF, and even gRPC; at the component level, we need to support components such as microservices, databases, message queues, task scheduling, and caching; on the JDK version, We need to be compatible with versions from JDK 1.7 to JDK18...During this process, we continue to iterate and improve, and we continue to receive positive feedback from customers to make the tool more and more perfect.
After a year of polishing, we finally developed a bytecode-based Java traffic proxy technology. The architecture is as follows:
This solution only needs to introduce an agent bytecode Agent, and does not introduce external dependencies. This proxy agent is much lighter than the container, and will be automatically pulled and injected by the device-cloud interconnection plug-in during the startup phase, and the upper-layer use is unaware. So far, we have finally solved the problem of traffic proxy for local microservice applications.
Going up the tall building alone, looking at the end of the world
In the past few years, we have been keeping our heads down, constantly discovering problems and solving them. At the same time, the cloud native community is also gradually evolving. Also in 2018, the kubernetes community published an article called Developing on Kubernetes, which has a very good summary of the different development models:
remote means cloud, and local means local. cluster is the K8s cluster, and dev is the development environment. According to the different positions of dev and cluster, the whole can be divided into four development modes:
• pure off-line: This mode means that your K8s cluster and development environment are both local. K3s, Minikube, and EDAS Core (let’s make a point here and introduce it later) all belong to this mode, and you can directly start a lightweight development cluster locally.
• Proxied: This mode means that the K8s cluster runs on the cloud, the development environment is local, and the cloud cluster and the local development environment are connected through a proxy. A typical representative of this model is the community's Telepresence and EDAS device-cloud interconnection. Telepresence is slightly better in terms of multi-language versatility, and EDAS device-cloud interconnection on Java is easier to use.
• Live: This mode means that the K8s cluster runs on the cloud, the development environment is local, and the local code updates the cloud application through CICD and other methods. This mode is the most common mode, and it is also generally the least efficient mode. If you deploy through CICD, it means that every time you modify the code, it takes a long time to build and deploy to take effect in the cluster. This iterative deployment process is very time-consuming if the code needs to be constantly modified for debugging during the development process.
• remote: This mode means that both the K8s cluster and the development environment are in the cloud. Cloud IDE is a typical example. Both the code and the running environment are in the cloud, and the code is edited locally through a browser or a lightweight end application. In fact, this method has not been recognized by the majority of developers. The experience of local IDE is better than that of Cloud IDE, and local development is still the mainstream.
In the proxied mode, we have polished the device-cloud interconnection quite well, but it cannot solve all problems. Such a scenario is not uncommon: local development and debugging are fine, but there are problems when it is deployed. The root of this problem is that the local operating environment is inconsistent with the operating environment in the cloud cluster, and this inconsistency will cause various problems. For example, a J that needs 2c4g can be run normally locally The ava process does not mean that the cluster on the cloud can also normally allocate a 2c4g Pod, because there may be no redundant resources in the current cluster.
There are so many such problems that it is impossible to enumerate them one by one. This also prompts us to further think about how to solve these problems. After half a year of research, exploration, and R&D, we developed the Cloud Native Development Kit (CNKit for short), which solves these problems and provides development, debugging, and diagnostic capabilities under the cloud-native architecture.
Cloud Native Development Kit
Our way of solving the problem is that to solve the problem of inconsistent environment, we can only go back to the environment. Although it seems simple to start an application in a cloud-native environment, it actually needs to go through many steps. The application needs to go through K8s scheduling, Pod initialization, and service startup, and finally the application operation can be completed. During this process, you may encounter the following problems:
For these problems, we summarized and precipitated a set of solutions: use CNKit to quickly replicate Pods, and then iteratively develop, deploy, debug, and diagnose. The overall functionality looks like this:
By integrating with EDAS full-flow flow control, CNKit allows only debugging traffic that meets specific rules to enter the replicated Pod without affecting other normal traffic. For replicated pods, you can use CNKit's out-of-the-box deployment, debugging, and diagnostic capabilities, and you can use an audit-based command terminal. The replication, deployment, debugging, and diagnostic capabilities are described in detail below.
This replicated Pod is equivalent to our own temporary workspace. We can continuously deploy our own application packages through the browser or IDE, and perform debugging and diagnosis. Replication Pod currently supports the following configurations:
The specific role is:
• Start command: It is the start command of the Pod. By default, the startup command of the original image is used, and if iterative deployment or debugging is required, a custom startup command is required. This is because in the production environment, the original image startup command often uses the application process as the No. 1 process. Once the application exits or restarts, the Pod will be released accordingly. Therefore, a special startup command needs to be set to prevent Pods from being released when the application exits.
• Replication mode: supports Pod-based replication or creation from the Deployment's spec.
• Target Node: The cluster node on which the Pod is running. By default, Pods are run through K8s scheduling, but you can also directly specify specific cluster nodes to run this Pod.
• Pod Logs: Configure to print Pod logs to stdout or redirect them to a file.
• Flow control: Through full-link flow control, only requests that meet specific rules can enter the Pod node.
• Diagnostic options: support running tcpdump immediately when the application starts for monitoring, one-click opening of JVM exception logging and removal of Liveness probes.
These options are based on years of EDAS customer support experience and may not look cool, but they are very useful. Take the configuration item "Remove Liveness Probe" as an example. If an exception occurs during the application startup phase, the Liveness probe will fail in this case, and K8s will directly kill the Pod container and restart it. We'd get stuck in an infinite loop of Liveness failing, Pod container being killed, then container restarting causing Liveness to be lost, and Liveness failing again. After removing the Liveness probe, you can go into the pod container to troubleshoot.
There is also a very interesting configuration item here - full link flow control. Full-link flow control is the killer feature of microservice governance on EDAS, which can make the traffic of the overall microservice link direct where to go. For the replicated Pod, we may want only our own requests to enter this Pod without affecting other people's invocation requests. In this case, you only need to check to join a specific flow control group on the interface, which is very simple to use.
Remember the original question? Under the cloud-native architecture, we need to go through CICD, EDAS deployment and K8s scheduling to pull up the application every time we deploy. This process is long and painful. Moreover, we often need to temporarily install specific tools when troubleshooting problems, and every time a new application Pod is pulled, it means that the required tools need to be reinstalled.
For this problem, we recommend a "copy once, use many" strategy. As mentioned above, we can create our own "temporary workspace" (essentially a Pod) by copying the Pod, and then we can directly deploy the application package to the temporary workspace through a browser or IDE, and perform debugging and diagnosis . The development process based on CICD and CNKit is quite different, as follows:
The CICD deployment path is suitable for the production environment, and the stability of online business is guaranteed through standard processes. But at the same time, it is a lengthy process, and the advantages are not obvious in the development stage, but it will reduce the development efficiency. In this case, the CNKit deployment process is a good complement. You only need to copy the Pod, and then you can continuously update the application package and debug the code through the browser or IDE.
Debugging (here specifically refers to remote debug) is a very important part of the application development process. If debugging is not possible, the efficiency of application development will be greatly reduced. Under cloud-native architecture, debugging applications is not so simple, but it can always be done. In this regard, in addition to simplifying the debugging process, CNKit also provides flow control capabilities:
• Simplified process: You only need to click "Enable Debugging" on the page, and CNKit will restart the application in the Pod to open the debugging port. Then connect to CNKit with one click locally through the IDE, and then you can start breakpoint debugging.
• Flow control: By integrating EDAS full-link flow control, CNKit can make only specific requests enter the replication Pod and trigger breakpoint debugging logic.
Through these two points, you can complete code debugging very conveniently. The figure below is a simple illustration example. If you are developing a commodity center, the upstream is the trading center, and the downstream is the inventory center. The overall link for debugging using CNKit is as follows:
The product center marked as the development version is the replicated Pod node. In the cloud environment, specific traffic is forwarded to the Pod through full-link flow control, and the developer locally connects to the debugging port of the Pod through the CNKit proxy.
In fact, in a service developed by multiple people in parallel, everyone can have their own development version node, and only need to set different flow control rules, so that parallel development can not interfere with each other.
We diagnose the problem into four stages: K8s scheduling, application startup, application running, and application offline. At different stages, the diagnostic methods adopted are not the same:
In the K8s scheduling process, we mainly focus on related events generated by it. When a scheduling exception occurs, K8s will give the relevant reason. The following figure shows K8s events during normal scheduling:
When there is a scheduling exception (such as scheduling failure due to insufficient resources), K8s will generate corresponding events:
In the application startup phase, in addition to K8s events, we can also observe Pod logs, which are generated by the application and contain more detailed information. The following figure shows a sample of the Pod's standard output log:
In addition, there will be more network access during the application startup phase. In many cases, application startup failures are caused by abnormal network requests. Therefore, CNKit supports automatically running Tcpdump before startup to record network requests. The figure below shows the Tcpdump package automatically captured when the application starts. CNKit supports two formats: text and pcap. The figure below shows the Tcpdump data in text format:
Finally, you can still use K8s events, Pod logs, and Tcpdump during the application running and offline stages, and you can also use the Arthas tool integrated with CNKit with one click. By running Arthas with one click on the page, CNKit will automatically complete the installation and operation of Arthas. The overall interaction is as follows:
So far, the replication, deployment, debugging and diagnosis of CNKit have been shared one by one. But in addition to these capabilities, CNKit also has some hidden easter eggs, such as auditing Webshell, etc. These places are left for readers to explore slowly, so I won’t repeat them here.
In addition to device-cloud interconnection and CNKit, we have also opened EDAS Core. According to the above criteria for the division of Developing on Kubernetes, the device-cloud interconnection belongs to the proxied mode, CNKit belongs to the live mode, and EDAS Core belongs to the pure off-line mode.
EDAS itself is a paid commercial product. It is a cloud-native PaaS platform for application hosting and microservice management. It provides full-stack solutions such as application development, deployment, monitoring, and operation and maintenance. Service operating environment. EDAS Core, on the other hand, is a free lightweight EDAS kernel version that also supports the above-mentioned capabilities, but strips off commercial features, does not provide service SLA and real-time operation and maintenance support, and is suitable for use in the development phase.
EDAS Core only needs 4 cores and 8g of machine resources at a minimum. We can run an offline EDAS platform on a local laptop and develop microservices. The overall architecture of EDAS Core is as follows:
Here is a brief explanation:
• EDAS Core: includes EDAS application hosting capabilities, supports Nacos service registration discovery and Minio persistent storage, can run on Kind, K3s, Docker-Desktop and K8s clusters, and only needs 4c8g resource occupation.
• Developer tools: Support Jenkins plug-in for continuous deployment, Terraform for infrastructure maintenance, ACT for local development, and compatible with EDAS open API and SDK.
• Installation Media: Supports EDAS Core installation via Helm, OSS, and ADP.
• K8s Cluster: This K8s cluster is the cluster managed by EDAS Core, on which microservice applications are run (and the service management OneAgent is automatically injected).
• Service integration: In terms of horizontal service integration, EDAS Core supports one-click conversion with EDAS commercial applications, integrates link tracking products such as ARMS and SkyWalking, and supports image hosting using ACR.
The following is the running interface of EDAS Core (old EDAS users should be familiar with this interface):
Currently EDAS Core is in an internal beta test state, if you want to use this capability, welcome to initiate a work order consultation with EDAS products on Alibaba Cloud :).
Both cloud-native architecture and microservice development are very popular technical fields, but the proposition of "microservice development under cloud-native architecture" is rarely mentioned by domestic manufacturers. As a pioneer in the field of microservice hosting, EDAS started to support the cloud-native architecture very early, and has been paying attention to the development of microservices under the new architecture.
From the earliest device-cloud interconnection model to the recently launched cloud-native toolkit (CNKit) and EDAS Core, EDAS has been thinking about new problems faced by the evolution of cloud-native technology from the perspective of developers, and continuously providing solutions to these problems tools and products.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00