Community Blog Data Insight: Landing of Descriptive Data Analysis in Log Scenarios

Data Insight: Landing of Descriptive Data Analysis in Log Scenarios

This article introduces the method of descriptive statistical analysis of data.

By Yunlei
Contributed by Alibaba Cloud Storage

With the rapid development of the Internet, logs generated by massive users are expanding rapidly every day. Data warehouses are like oil wells. We all know about mines, but digging out valuable resources from the mines has been the biggest problem perplexing mine owners. Without powerful analysis tools or data analysis capabilities, no matter how much data the warehouses can generate, it will all end up being a data grave. It takes a lot of effort to collect the data and pack it into the cold storage warehouse. There is no access until the disk space is full and the oldest data is cleaned up. Our decisions often depend on personal experience and intuition rather than the information based on data. We need powerful data analysis tools and professional data analysis experience to dig out useful information from the data to assist us in making decisions.

Alibaba Cloud Log Service collects logs from various isolated island systems and stores them on the cloud. It provides SQL, a common and powerful data analysis capability, allowing users to analyze tens of billions of data within seconds. The article series on data insights introduces a range of data analysis experiences to help our customers maximize the value of their data.

There are six categories of statistical analysis in data analysis:

  1. Descriptive Statistical Analysis: It tells us what happened.
  2. Inferential Statistical Analysis: It infers what happens to big data sets from small data sets.
  3. Predictive Statistical Analysis: It tells us what will happen in the future.
  4. Guiding Analysis: It tells us what to do next.
  5. Exploratory Data Analysis: It gives a preliminary impression of the data and tells us some key findings.
  6. Causal Analysis: It tells us why something happens.

Today, we introduce descriptive data analysis methods to help us have a general understanding of obtaining data. When we have a piece of data, and there is no clear purpose for analysis, we might as well use descriptive analysis first to see what information it can provide us.

Categories of Descriptive Data Analysis

1. Describe the Frequency of the Data

Understanding the frequency of events is critical, including the number and percentage of events. In the field of logs, we can view the frequency fluctuations of events in the time dimension, such as the change of PV per minute.

* | select date_trunc('minute',__time__) as t, count(1) as pv group by t order by t limit all

2. Measure the Trend of Data Concentration

Centralized trends tell us where the center point of the data is, which is measured using three averages: mean, median, and mode. Just like when we look at a geometric figure, we will find out where the geometric center or center of gravity is. If we know the approximate center of gravity, we know a large amount of data revolves around the center.

The method to calculate the mean and median in SQL is listed below:

* | select avg(key), approx_percentile(key,0.5)

The method to calculate the mode is listed below:

* | select key,count(1) as pv group by key order by pv desc limit 1

3. Dispersion Calculation

After we know where the center of the data is, we need to know how other data is distributed around the center, namely whether it is divergent or aggregated from the center. The indicators for evaluating dispersion are variance (standard deviation), coefficient of variation, and absolute median deviation.

The standard deviation is used to evaluate the distance of each number from the mean. But the standard deviation is an absolute value, and it is difficult to determine whether a certain standard deviation is high or low. Therefore, we need a proportional coefficient of variation to make a more intuitive judgment. The coefficient of variation is the standard deviation divided by the mean. The absolute median deviation refers to the distance between each number and the median, similar to the standard deviation, so a coefficient is needed to make the judgment. The median can be divided by the absolute median deviation.

SQL can be used like this:

* | select stddev( key) # Standard Deviation
* | select stddev( key )/ avg (key) # Coefficient of Variation
* | select sum (diff)/ count (1) as mad from (select abs (key- approx_percentile( key, 0.5) over ()) as diff from log) # Calculate the absolute median deviation.
* | select sum (abs (key -p50))/ count (1 )/p50 from (select key, approx_percentile( key, 0.5) over () as p50 from log) # Absolute Median Deviation Coefficient

4. Location Assessment

The mean and median are estimates of the total. When we assess the position of a single number, the best way is to use quantiles or quartiles.

Quantile and quartile calculation method:

* | select approx_percentile(key,0.1) as p10, approx_percentile(key,0.9)as p90 

5. Assess the Relationship between Multiple Columns

The preceding indicators are used to assess individual figures. If it is the analysis of more than two variables, we will pay attention to the relationship between the two variables. We can calculate the correlation coefficient of two columns on SQL. We can take scatter diagrams or cross tables with visualization.

* | select corr(key1, key2)# Calculate the correlation coefficient.


This article introduces the method of descriptive statistical analysis of data.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

0 1 0
Share on

Alibaba Cloud Community

606 posts | 102 followers

You may also like


Alibaba Cloud Community

606 posts | 102 followers

Related Products