Troubleshoot Kubernetes in just 3 steps
Step 1: Understand
Step 2: Management
The last step: prevention
Not a single step
The Kubernetes ecosystem is full of tools such as monitoring, observability, tracing, logging, etc., but it is generally difficult to really understand how troubleshooting relates to these tools.
When a failure occurs, we need to understand where it happened, understand the problem we are facing, fix the problem at hand, and then fix the root cause. This all becomes more and more complex as the system scales.
A software engineer working on modern, complex, distributed systems will often find that every time a problem or failure occurs, you need to understand what caused the problem and who caused it, but this is not an easy task . Even harder is figuring out what's going on behind the scenes and how to prevent it from happening again.
Generally, we think like this:
•What exactly happened?
•Which things are relevant?
•What is related to the specific symptom we are trying to troubleshoot?
•How do we determine the root cause?
•Ultimately, we also want to make sure this or similar issues don't happen again in the future?
In this article, we simplify it into 3 steps:
I'll dive into how to implement these three steps well and how they can help us troubleshoot Kubernetes. I will also review which ecosystem tools are suitable for which steps to better master or use those tools.
【Troubleshoot Kubernetes in just 3 steps】Step 1: Understand
Not surprisingly, this is an important step. Understanding system resources often enables you to understand what happened, what went wrong, and what we should do next.
To try to understand the cause of the failure, developers will first analyze recent modifications to the system and changes that might have caused this failure.
Of course, this is easier said than done. In complex distributed systems, especially Kubernetes-based systems, this means heavy use of kubectl to troubleshoot deployment logs, traces, and metrics, verify pod health and resource caps, and service connections, among other common pods errors, checking YAML configuration files, validating third-party tools and integrations, and more. It could be a line of code, a line of configuration change, that triggered the failure.
The picture below can help us narrow down the scope of the problem when troubleshooting the K8s system.
Image credit: Guide to Troubleshooting Kubernetes Deployments
Next, we'll look at events: what's actually happening in the system - is the system overloaded? Did you lose data? Is there a service interruption? How does this relate to the initial change in the system?
Then, we look at the metrics, dashboards, and data we created to gain some understanding of the problem based on the data source. Does more than one system behave the same? Is there a dependency on one of the services affecting both systems? Finally, can we learn something from seemingly similar previous events that give us some idea of what we're going through right now?
Just for some cases, here is a list of some of the tools you need to use to get a basic understanding of what's going on in your system.
Monitoring tools: Datadog, Dynatrace, Grafana Labs, New Relic
tools: Lightstep , Honeycomb
Real-time debugging tools: OzCode , Rookout
Logging tools: Splunk, LogDNA , Logz.io
【Troubleshoot Kubernetes in just 3 steps】Step 2: Management
In today's microservice architecture, many times interdependent services are managed by different teams. When a failure occurs, one of the keys to solving the problem is communication and collaboration between teams.
Depending on the type of underlying problem, actions you may want to take include something as simple as rebooting the system, or more 'draconian' measures such as a version rollback or restoring a recent configuration until the underlying problem is more clearly understood. Ultimately, you may need to take proactive steps such as increasing capacity in the form of increasing the memory cap or the number of machines. However, none of this should be something you try and figure out in real-time. There are many tools today, from Jenkins to ArgoCD , cloud provider's proprietary tools, and even more kubectl to take these actions and measures.
Once the underlying problem is better understood, remediation should not be an ad hoc operation that is mostly trial and error, or a record that exists in current teams and practices. Depending on the company's technology stack and probable root cause, a customized runbook should be used to manage any given event, with specific tasks and actions for each alert.
Every engineer on the team, senior or junior, can take advantage of this friendly runbook for real-time troubleshooting.
The toolkit at this stage will include some of the following:
Incident Management: PagerDuty, Kintaba
Project Management: Jira, Monday, Trello
CI/CD management: ArgoCD , Jenkins
【Troubleshoot Kubernetes in just 3 steps】The last step: prevention
Prevention is probably the most important step to ensure that similar incidents do not happen again. The way to prevent similar problems is to create well-defined policies and rules based on each event. What actions to take during the "understand" phase, and how do we most quickly identify and escalate issues to the relevant teams?
How do we delegate responsibility to ensure frictionless communication and collaboration between teams? This includes full transparency of the tasks and operations at hand, as well as real-time updates on progress. What is the canonical order of tasks for each alert and event?
Once we figure out all of the above, we can start thinking about how to automate and coordinate these events and get as close as possible to the fabled "self-healing" system.
This step features tools to create systems that are more resilient and adaptable to change by continually pushing the system to its limits. E.g:
Chaos Engineering: Gremlin, Chaos Monkey, ChaosIQ
Auto Remediation: Shoreline, OpsGenie
Not a single step
We believe that by combining the above three steps, it is possible to differentiate troubleshooting from monitoring, observability, tracing, etc. Probably the most important thing, however, is to get deep into systems and processes to prevent them from happening again.
The reason they're still widely used and so popular is that even after we've made huge strides in "DevOps tools", there are many times when we're still struggling to deal with real-time issues or failures.
Therefore, I recommend centralizing application and operational data into a single platform that enables team members to truly understand their systems and ultimately understand how to act on alerts of complex systems. When we bring the best developers and operations together, we can resolve failures faster by working better together.
The Three Pillars of Kubernetes Troubleshooting - DZone Cloud
Copyright statement: The content of this article is contributed by Alibaba Cloud's real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own the copyright and does not assume the corresponding legal responsibility. For specific rules, please refer to the " Alibaba Cloud Developer Community User Service Agreement " and " Alibaba Cloud Developer Community Intellectual Property Protection Guidelines ". If you find any content suspected of plagiarism in this community, fill out the infringement complaint form to report it. Once verified, this community will delete the allegedly infringing content immediately.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00