IT Operations for the Changing World – Part 2

By Shantanu Kaushik

A Recap of Part 1

In Part 1 of this 2-part series, we discussed the importance of moving on from the traditional practices of monitoring and metrics collection that drive application management and enabling a smart AIOps based IT operations portfolio that is self-aware, robust, and application-centric.

This article discusses how pattern recognition, behavioral assessment, intelligent log management, and event management based on predictive patterns can help you lead a seamless IT operations management practice.

Behavioral Pattern Recognition

The prime indicators of the problem at hand are all related to service fluctuations and resource usage. However, these should be factored in after the differences in service availability at different times show a recurring pattern. While service fluctuations and resource throttling are strong indicators, other factors need to be ruled out before reaching a strong consensus on the actual issue.

Behavioral pattern recognition utilizes the collected metrics to filter out normal activity where normal activity also includes mild fluctuations in service due to external factors like traffic load to separate the problems from these ups and downs. AIOps with machine learning can nullify the recurring patterns related to service fluctuations by establishing policy recognition, resource parameters, and traffic load adjustment scenarios. This enables the system to establish a baseline to understand anomaly patterns and rate them accordingly.

Data collected for weeks can be used to determine service variations and other metrics that showcase a pattern to establish critical alerts from usual dips. Let’s use an example here to clarify what we wish to establish. If your resource utilization fluctuates over the weekends starting on Friday, the spike in resource utilization can be ruled out as predictive behaviour based on the audience or user base that you are targeting your service on, r freeing the system to detect issues. This saves a considerable amount of time for the O&M team.

Let’s use another example of a less recurring instance. In this case, you get traffic spikes near quarterly or yearly financial account settlements. Again, depending on your target audience, you could be extending a financial service using your website or an application. The resource usage will draft a pattern for your AI-enabled machine learning-based IT Ops solution to derive a pattern and adjust accordingly.

You will still receive alerts regarding resource fluctuations, but your APM service will let you overcome the dip easily. These types of usual behavior can be managed without user interaction using Auto Scaling and Server Load Balancer (SLB). Behavioral pattern recognition will allow your system to be free of false alarms and undue diagnostic operation, adding another layer of stability to your IT Ops practice.

Cause Analysis

Sifting through data to find the root cause of an issue can take a lot of time, as there are multiple factors to consider and different modules that could be affected. Cause analysis based on data analytics and AIOps can decipher the mystery between the issue and cause at a much faster rate. Data analytics can factor in all the details that may impact the performance of an application or service. You can easily deduce the root cause based on probabilities presented by your IT Ops practice.

When an issue is detected, the anomalies are flagged according to the time taken from the initial event and which modules and systems are affected. Here, behavioral pattern recognition comes into play and allows your APM solution to work backward from the last event to the first event that occurred, linking and recording all the modules that are affected by the issue.

The list of possible causes is presented to you. Your AIOps solution can automatically take care of the issue based on set policies, or you can adopt a manual approach and intervene.

Predicting Events

Behavioral pattern recognition and cause analysis can help you predict issues before they appear. As they say, for any issue that occurs, leave a trail of bread crumbs in both directions, i.e., lead a path towards the issue and leading away from it. Let me explain this analogy a bit more.

When a major issue occurs, the indicative patterns will emerge before the outage happens. Certain modules will malfunction and create new events. These events will reference more anomalies within the system and lead you to the root cause.

Now, let’s take a step back. We have already implemented the smart AIOps solution to record patterns and behavior to show us the possible cause of the issue. As new patterns emerge, the intelligent APM system will alert you of a possible outage based on pattern recognition and machine learning, saving you from a system-wide outage by allowing you to fix the issue before it happens.

Smarter Logging

As we previously discussed, machine learning and AI-based data analytics can influence a monitoring system to establish baselines. Here, a smarter logging system can be implemented to record your service events and the actions of the AI-based APM.

A highly detailed health report can be generated for your IT environment using this smart logging solution. It will cover topics that traditional monitoring and logging systems cannot entail.

Smart logging can also be implemented in real-time pattern recognition scenarios, where the number of log entries may change for previously recognized patterns. Then, it starts to occur more frequently to break the previously set pattern and indicate any underlying issue that might be escalating.

Wrapping Up

IT operations need to deal with more complex situations and scenarios with an ecosystem that is rapidly diverging through the beginning of digital transformation for the masses. In this context, artificial intelligence-based solutions can be useful by helping enterprises understanding patterns and making predictions based on real-life data.

Alibaba Cloud (and their usage of machine learning within the product cycle) is leading the charge to enable a seamless and predictive service cycle for a smarter ITOM that allows you to predict and prevent issues before your business takes a hit and rids your system of unwanted alerts.

A smart system like this allows you to diagnose issues quickly and accelerate your Mean Time to Detection (MTTD) and Mean Time to Recovery (MTTR).

Community

IT Operations for the Changing World – Part 2

A Recap of Part 1

Behavioral Pattern Recognition

Cause Analysis

Predicting Events

Smarter Logging

Wrapping Up

Upcoming Articles

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

Platform For AI

Server Load Balancer

Epidemic Prediction Solution

Simple Application Server