Community Blog Pandas + SLS SQL: Data Pivot with Flexibility and High Performance

Pandas + SLS SQL: Data Pivot with Flexibility and High Performance

This short article discusses Pandas and the advantages/disadvantages of SQL Language.

By Yunlei from Alibaba Cloud Storage

What Is Pandas?

Pandas is a powerful Python data analysis tool and a standard tool for various data modeling. Pandas is good at processing numeric data and time series data. The first advantage of Pandas is that it encapsulates some complex code implementation processes without the need to call the interface, thus avoiding writing a large amount of code. The second advantage of Pandas lies in its flexibility. It can automatically process complex logic in batches, which cannot be done by tools such as Excel. Therefore, Pandas is a data analysis tool between Excel and self-written programs that features flexibility and simplicity.

In terms of input, Pandas supports reading files in various formats, including csv, orc, xml, and json. It also supports reading distributed file system HDFS and MySQL or MySQL-compatible data warehouses through the JDBC protocol. The input data is converted into the data structure DataFrame in the memory, and the subsequent data analysis is performed around DataFrame.

In the output, Pandas can realize impressive visualization effects and connect with many pleasing visualization libraries to realize the interactive effects of dynamic data.

Pandas is a Python scripting language with average performance. It can only process a small amount of data and is not on par with modernized data warehouses in terms of computing power. Can such flexible Pandas analysis be integrated with data warehouses to give data warehouses a more flexible data analysis capability and obtain large-scale data analysis capability simultaneously?

The Advantages and Disadvantages of SQL Language

SQL is currently the most widely used data analysis language. It has become the standard language for various data analysis systems since it was developed by IBM in the 1980s. The reason is SQL is a declarative syntax. Users only need to declare their desired results without the need to specify the process of obtaining the result. This method has two advantages. On the one hand, the calculation results with the highest performance and the lowest cost require writing complex algorithms and understanding the hardware characteristics of the machine, which can only be done by professional database kernel engineers. It is too complicated for data analysts. Therefore, the declarative syntax unburdens data analysts, reduces the threshold of data analysis, and expands the users of SQL. On the other hand, without specifying the running process, the database kernel engineers are given more freedom to generate the best execution plan. This is the advantage of SQL.

The theoretical basis of SQL stems from relational algebra. The object of any operation is a relationship, and the result of any operation is also a relationship. Relationship + operation generates a new relationship. The user can see a relationship entity at any time. This set of strong theoretical basis allows an SQL statement to extend indefinitely. Furthermore, a relationship can be obtained at any time and turn into another relationship with one more operation added.

Since SQL is based on relational algebra and model, we can think of the entity of relationships in the relational model as a two-dimensional table containing multiple rows and columns. The number of rows is unlimited, while the number of columns is limited. The number of rows is dynamic, which can be 0 or infinite. The number of columns is static and unchangeable. Regardless of whether there is data or not, it is a fixed number of columns output. The method of static columns limits the application of SQL in some scenarios. Two typical scenarios are matrix transposition and pivot table (cross table) generation. In both scenarios, the number of columns is dynamic. Therefore, SQL needs programming partially to achieve complete data analysis.

Advantages of Log Service (SLS) SQL


SQL is a syntax representation and a language for users to interact with the data warehouse system. The real power of the data warehouse lies in its kernel. SLS log data warehouse uses SQL as the syntax interface. With the cloud-native distributed architecture, it can implement query-level elastic analysis capabilities together with the analysis of hundreds of billions of data at one time.

Pandas features analysis flexibility, and SLS provides powerful SQL analysis capabilities. The combination of the two allows users to enjoy the powerful SQL analysis capability of SLS and the flexible data analysis and analysis library of Pandas. How do we combine the two?

Pandas Connected with SLS for Fusion Analysis

Pandas supports the JDBC interface to read data, and SLS also supports the JDBC protocol. Thus, Pandas can connect to SLS through the JDBC protocol. Pandas passes the calculations to SLS through SQL for large calculations in analysis tasks. Secondary analysis and visualization are performed on Pandas for flexible analysis and analysis that SQL fails to complete. For example, build a pivot table or a cross table. First, use SQL to complete the cross calculation of two dimensions, which often requires a large number of calculations. Then, use Pandas to complete the row and column conversion and display it as a two‑dimensional table.

For example:

import numpy as np
import pandas as pd
import pymysql

# sql Command
dbname=""  # project is database
sql_cmd = "select method,status ,count(1) as pv from access_log group by method, status limit 1000"
con = pymysql.connect(host=slshost, port=10005,user=username, password=password, database=dbname, charset='utf8', use_unicode=True)
data = pd.read_sql(sql_cmd, con)
tab=pd.pivot_table(data,values="pv",index="status",columns="method" )

In the example, the SQL statement analyzes nginx access logs and calculates the pv of method and status. Then, call the pivot_table function of Pandas to build the pivot table.

The following figure shows the result:


0 1 0
Share on

Alibaba Cloud Community

858 posts | 196 followers

You may also like