By Yunlei
Contributed by Alibaba Cloud Storage
With the rapid development of the Internet, logs generated by massive users are expanding rapidly every day. Data warehouses are like oil wells. We all know about mines, but digging out valuable resources from the mines has been the biggest problem perplexing mine owners. Without powerful analysis tools or data analysis capabilities, no matter how much data the warehouses can generate, it will all end up being a data grave. It takes a lot of effort to collect the data and pack it into the cold storage warehouse. There is no access until the disk space is full and the oldest data is cleaned up. Our decisions often depend on personal experience and intuition rather than the information based on data. We need powerful data analysis tools and professional data analysis experience to dig out useful information from the data to assist us in making decisions.
Alibaba Cloud Log Service collects logs from various isolated island systems and stores them on the cloud. It provides SQL, a common and powerful data analysis capability, allowing users to analyze tens of billions of data within seconds. The article series on data insights introduces a range of data analysis experiences to help our customers maximize the value of their data.
There are six categories of statistical analysis in data analysis:
Today, we introduce descriptive data analysis methods to help us have a general understanding of obtaining data. When we have a piece of data, and there is no clear purpose for analysis, we might as well use descriptive analysis first to see what information it can provide us.
Understanding the frequency of events is critical, including the number and percentage of events. In the field of logs, we can view the frequency fluctuations of events in the time dimension, such as the change of PV per minute.
* | select date_trunc('minute',__time__) as t, count(1) as pv group by t order by t limit all
Centralized trends tell us where the center point of the data is, which is measured using three averages: mean, median, and mode. Just like when we look at a geometric figure, we will find out where the geometric center or center of gravity is. If we know the approximate center of gravity, we know a large amount of data revolves around the center.
The method to calculate the mean and median in SQL is listed below:
* | select avg(key), approx_percentile(key,0.5)
The method to calculate the mode is listed below:
* | select key,count(1) as pv group by key order by pv desc limit 1
After we know where the center of the data is, we need to know how other data is distributed around the center, namely whether it is divergent or aggregated from the center. The indicators for evaluating dispersion are variance (standard deviation), coefficient of variation, and absolute median deviation.
The standard deviation is used to evaluate the distance of each number from the mean. But the standard deviation is an absolute value, and it is difficult to determine whether a certain standard deviation is high or low. Therefore, we need a proportional coefficient of variation to make a more intuitive judgment. The coefficient of variation is the standard deviation divided by the mean. The absolute median deviation refers to the distance between each number and the median, similar to the standard deviation, so a coefficient is needed to make the judgment. The median can be divided by the absolute median deviation.
SQL can be used like this:
* | select stddev( key) # Standard Deviation
* | select stddev( key )/ avg (key) # Coefficient of Variation
* | select sum (diff)/ count (1) as mad from (select abs (key- approx_percentile( key, 0.5) over ()) as diff from log) # Calculate the absolute median deviation.
* | select sum (abs (key -p50))/ count (1 )/p50 from (select key, approx_percentile( key, 0.5) over () as p50 from log) # Absolute Median Deviation Coefficient
The mean and median are estimates of the total. When we assess the position of a single number, the best way is to use quantiles or quartiles.
Quantile and quartile calculation method:
* | select approx_percentile(key,0.1) as p10, approx_percentile(key,0.9)as p90
The preceding indicators are used to assess individual figures. If it is the analysis of more than two variables, we will pay attention to the relationship between the two variables. We can calculate the correlation coefficient of two columns on SQL. We can take scatter diagrams or cross tables with visualization.
* | select corr(key1, key2)# Calculate the correlation coefficient.
This article introduces the method of descriptive statistical analysis of data.
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
The New Batch Alibaba Cloud MVP for Spring 2023 Opens for Application
606 posts | 102 followers
FollowAlibaba Developer - September 28, 2021
Alibaba Cloud Native Community - February 13, 2023
CloudNative - May 10, 2022
Alibaba Clouder - February 2, 2021
Alibaba Cloud Indonesia - September 28, 2022
Alibaba Clouder - February 22, 2019
606 posts | 102 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreDeploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreMore Posts by Alibaba Cloud Community