The Practice of Semi-Structured Data Processing Based on MaxCompute SQL

By Kong Liang, Product Expert at Alibaba Cloud Intelligence

1. A Basic Introduction to MaxCompute

Alibaba Cloud Solution for Data and Analysis

As an enterprise-level data warehouse service, MaxCompute centrally stores and manages enterprise data assets, processes and analyzes data for data applications, and transforms data into business insights. MaxCompute can build rich data applications through flexible combination with the internal and external services of Alibaba Cloud. MaxCompute provides a fully-managed data and analysis solution that simplifies platform O&M, reduces management investment, improves business-oriented service capabilities, and accelerates value realization.

As shown in the following figure, MaxCompute is at the core. MaxCompute is a data warehouse and an ultra-large-scale cluster product based on Serverless architecture, with enterprise-level capabilities, such as security management. Previously, it was an offline data processing platform. Now, it is an enterprise-level data warehouse with business intelligence (BI) analysis capabilities.

Generally, offline enterprise data comes from data sources, including relational databases, unstructured storage, big data storage, and message queues. The data can enter the data warehouse through offline data integration and data batches. Various analysis applications in the frontend can also provide real-time analysis and query acceleration capability through MaxCompute-Hologres. This is an offline scenario. For real-time procedures, data is sent to Realtime Compute for Apache Flink from Message Queue through DataHub. Then, it is forwarded to Hologres and then to the foreground. The data is processed by Hologres, MaxCompute, and Flink. This is called stream-batch unification. Based on the data warehouse, MaxCompute extends some federated query capabilities of databases, including Data Lake, MySQL, and HBase. By doing so, it achieves the data processing capabilities of data lakes, such as data integration, metadata synchronization, and query. MaxCompute also includes the capability of Machine Learning Platform for AI (PAI), which is based on Alibaba Cloud's data warehouse. These form a complete basis for big data. On top of this, DataWorks provides an all-in-one development and governance platform that can perform task scheduling, lineage management of metadata, and data development.

An Introduction to MaxCompute

MaxCompute is a multi-functional, high-performance, and easy-to-use data warehouse service.

MaxCompute is built with comprehensive enterprise-level security and management capabilities and supports an open data ecosystem. It is a unified platform that meets the needs of multiple scenarios (data warehouse, BI, data lake analysis, and machine learning). It is widely used in scenarios, such as data-based operations, user profiling and recommendation, and intelligent prediction.

MaxCompute provides unified storage and computing scheduling based on IaaS at its underlying layer. Its storage and scheduling capabilities are supported by Apsara Distributed File System (Pangu) and Job Scheduler (Fuxi), respectively. The storage and computing resources are encapsulated, which can be used by upper-layer applications as a resource pool without perception. The upper-layer applications only need to use this resource group without knowing what resources the specific tasks are running on. MaxCompute provides the storage capability of a structured data warehouse and the processing capability of a data lake for open-format, semi-structured, and unstructured data. For users, all usage scenarios are in the project. Tenant isolation is supported among different projects. Each project can use personal resources or share resources with others. Projects can synchronize data directly through secure sharing. Unified access authentication, management, security, monitoring, and metadata capabilities are used in the upper layers.

MaxCompute contains the following features:

Simple and easy-to-use SQL end-to-end development mode, support for Spark, distributed Python (Mars), and other open-source technology stacks, built-in complete enterprise management capabilities, and open interfaces – all of which are open and easy to use.
Unified metadata, unified data storage, and unified enterprise data asset, a cloud-native multi-tenant system, and the most efficient cross-business and cross-organization data connection and sharing
Self-adaptive and on-demand elastic resources to match business scale changes accurately and avoid waste or resource shortages. Isolation of business loads to eliminate inter-business resource competition
Storage and computing are independently scaled. Support for terabyte-level to exabyte-level storage expansion. Connection with a wide range of external data sources to carry out federated computing
Deep optimization, automatic data warehouse, integration with optimization capabilities accumulated during Double 11 over the years, intelligent tuning, and expert service

An Introduction to MaxCompute Functions

The functions of MaxCompute can be classified into the following categories. The core storage and computing capabilities are not open. MaxCompute mainly uses database tables for storage, and its computing resources are reflected in SQL tasks or other computing models. With the computing capability of SQL, you can complete the end-to-end data processing and data model management of the entire data warehouse, including some basic data types, partitioning of internal tables, and processing of unstructured data in external tables. MaxCompute supports streaming write, streaming upsert for data insertion, data deletion, and other capabilities. You can use very complex query methods to query and use Explain keyword in SQL to see the parsing plan. On the UDF side, MaxCompute supports Java UDF, Python UDF, and Content Moderation UDF.

Now, let's discuss the management capabilities. The dark parts of the following figure show independent enhancement packages on the private cloud that provide additional computing capability for enterprises. While on the public cloud, management capabilities are provided to users in Serverless mode, and it only charges for resources. Management capabilities also include billing capabilities, such as subscription and pay-as-you-go billing. Task management is provided for tasks, query acceleration for queries, and enterprise-level capabilities like large-scale cross-domain computing for deployment on the private cloud. In addition to the SQL engine, MaxCompute has vector retrieval, TensorFlow, Spark, Mars, and Hologres. Based on the underlying storage and computing resources of MaxCompute, they can use different engines to provide capabilities for various scenarios. In the compliance governance section, some capabilities of the gray parts are provided by DataWorks. These include data warehouse security management, metadata management, auditing, data encryption, data masking, and data quality. Tools provided by SDK/API and MaxCompute are used to complete data development, data upload, and data download. Some third-party and internal applications are used to construct the entire data warehouse ecosystem.

Main Solutions Provided by MaxCompute

Enterprise Data Warehouse and Data Middle Platform

Raw data is organized into widely available knowledge for subsequent use. Some examples are listed below:

Integrated Storage: Collect, store, and centrally manage data within and outside the enterprise
Processing and Analysis: Clean, process, and integrate multi-party data; data statistics and mining for business requirements; unified storage and processing can provide storage and computing capabilities featuring auto scaling and lower costs.
Standardization: Establish an enterprise data warehouse model (with different layers and themes), establish data standards and form reusable data assets, carry out security and cost governance from the data lifecycle through data governance, and ensure continuous data quality and standardization
Data Sharing: Ensure the circulation and sharing of standard data within the enterprise, avoid data silos, and deliver the maximum value of interconnected data

The data middle platform is not only a technical platform. It also includes organizational and management process requirements. It emphasizes serving the business with public data products, realizing business-oriented data. This can be considered as a best practice for a data warehouse. The combination of MaxCompute + DataWorks works as an out-of-the-box data warehouse solution.

BI Analysis and Data Analysis

BI analysis does not necessarily use a data warehouse. It can be based directly on transaction database analysis.

A data warehouse can help provide comprehensive data from the enterprise view that BI analysis needs.

BI analysts can retrieve and understand data better through data asset management of the data warehouse.

With its powerful performance, a data warehouse can also meet the needs of multi-user concurrency and data analysis on different scales.

MaxCompute provides centralized data management, processing, and analysis. It can connect to BI directly or synchronize data to an external analytical database for OLAP analysis.

Predictive Analysis and Intelligent Application

The data warehouse is increasingly integrated with AI. The data warehouse processes and prepares data for machine learning. Machine learning can perform model training and data prediction. The results can be persisted directly in the data warehouse for knowledge sharing. Things like user profile analysis can predict customer gender and preferences.

MaxCompute seamlessly integrates PAI and SparkML. With MaxCompute, you can build intelligent applications based on machine learning, such as CTR estimation and personalized recommendation, using enterprise data in one platform without data migration.

2. Semi-Structured Data Processing of MaxCompute

What Is Semi-Structured Data?

The topic of this article is the semi-structured processing capability of MaxCompute. First, let's look at what semi-structured data is. Structured data is row-oriented data. It is stored in the database and can be expressed logically in a two‑dimensional table structure. Unstructured data includes office documents, text, images, XML, HTML, various reports, images, and audio/video files in all formats.

Semi-structured data is data between fully structured data (such as data in a relational database and object-oriented database) and completely unstructured data (such as sounds and image files.) An HTML file is a kind of semi-structured data. It is generally self-describing and frequently changing, and its structure and content are mixed together. It is generally represented by three components, including tags, types, and object values.

We will compare the three data types above through data models:

Structured Data: Relational (two‑dimensional table)
Semi-Structured Data: A directed graph structure with root. It is composed of node set and arc set. (tree and graph)
Unstructured Data: None

Finally, we will look at the definition of them from Wikipedia. Semi-structured data is characterized by complex structures. It is changeable and needs to extract data from the self-describing structure for computing.

Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as a self-describing structure.

In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important.

Semi-structured data are increasingly occurring since the advent of the Internet, where full-text documents and databases are not the only forms of data anymore, and different applications need a medium for exchanging information.

Semi-structured data is widely used because:

It is concise, simple, and small in size.
It is easy and efficient to use.
It provides cross-language capabilities. It can be used by frontend and backend interaction interfaces, as configuration files, or for file storage in Web projects. The popularity of mobile applications drives the use of semi-structured data, such as JSON data. (Please see the following figure for JSON data examples. This figure comes from the Internet.) These data can be used as data sources and stored in a data warehouse for analysis.

Therefore, semi-structured data processing is a typical application scenario of a data warehouse.

Semi-Structured Data Processing

The general order of a data warehouse is data source, data processing, and data storage. This process is the main procedure of semi-structured processing. In the upper layers, the data processing may have been completed, and the result is oriented to the application. So, its semi-structured feature is not obvious.

There are two methods for semi-structured data processing. One is to synchronize the data into a field. Each time we use it, we can use some complex types. JSON functions can also extract directly on demand, but the data is put into one field. The advantage is that we do not have to consider the changes in the structure of semi-structured data. The disadvantage is poor performance. Each time we have to choose a suitable processing function and method, and the development is complicated. Every kind of data is stored in string format, which is large. For example, MaxCompute generally supports 8 megabytes, but MaxCompute can also support 256 megabytes to deal with this disadvantage. Another method is to split data into a wide table according to the JSON structure during import or batch processing. Then, we can modify or rebuild the table structure as the JSON structure is modified. The advantage of this method is that storage and computing can both be optimized. However, the disadvantage is that the table structure is often modified, which is inconvenient.

Semi-Structured Data Processing of MaxCompute

MaxCompute provides the following four capabilities to process semi-structured data.

Complex Types for Semi-Structured Data Storage

First, MaxCompute provides complex data types to store the corresponding semi-structured data.

Schema Evolution Corresponding to the Modification Defined by the Semi-Structured Data Structure

It also provides the capability of schema evolution to modify tables and nest columns, including:

Delete a column
Add a column
Change the sequence of columns in a table
Change the name of columns
Change the type of data in columns (compatible type)
Modify the nested structure of complex types (same as modifying the table structure)

Semi-Structured Data Processing Function for Processing Semi-structured Data Node Values

MaxCompute SQL has added a large number of built-in functions to improve the processing capability and efficiency of complex data types, such as ARRAY, MAP, and STRUCT. You can use built-in functions to process complex input or output data types after function processing.
It also provides higher-order functions to enhance the processing capability of complex data types. Compared with ordinary functions whose input parameter can only be data, the input parameter of higher-order functions can be a function. Therefore, higher-order functions can process the data of complex data types and use Lambda expressions to simplify logic syntax expression.

Computing Directly Using Semi-Structured Data Node Values

CREATE TABLE evol_t2 (id int, name struct<given: string, family: string>,phones array<struct<type: string, num: string>>) ;
 insert into table evol_t2 select 1, STRUCT('Bill', 'Gates'), array(STRUCT('work', '1234567890'),STRUCT('cell', '9876543210'));
insert into table evol_t2 select 2, STRUCT('Michael', 'Jordan'), array(STRUCT('work', '1111111111'),STRUCT('cell', '9999999999'));

The insertion result is listed below:

select name.given as firstname,c.phones[1].num as phonenum
from evol_t2 c
where c.phones[1].type = 'cell';

The query result is listed below:

Table Structure Modification of MaxCompute

MaxCompute can modify the table structure flexibly. It supports the schema changes of semi-structured data sources and facilitates the adjustment of the data warehouse modeling model. This is convenient for adding and eliminating fields in the existing table and putting the same fields together or modifying the types.

Syntax definitions and examples are listed below:

Delete Column

ALTER TABLE <table_name> DROP COLUMN <column_name>;
create table if not exists evol_t(id bigint,value1 bigint,value2 bigint);
ALTER TABLE evol_t DROP COLUMN value2;

Add Column

ALTER TABLE <table_name> ADD COLUMNS (col_name1 type1[, col_name2 type2...]);
create table if not exists evol_t(id bigint,value1 bigint,value2 bigint);
ALTER TABLE evol_t ADD COLUMNS value3 STRING;

Note: The new column does not follow the specified order and is placed at the end of the table by default.

Change Column Order

ALTER TABLE <table_name>
CHANGE COLUMN <original_column_name> <new_column_name> <column_type> AFTER <column_name>;
create table if not exists evol_t(id bigint,value1 bigint,value2 bigint);
ALTER TABLE evol_t CHANGE COLUMN value2 value3 bigint AFTER id;

Note: BEFORE keyword is not supported currently, but its function can be implemented through AFTER. If necessary, we will add it in subsequent updates.

Change Column Name

ALTER TABLE <table_name>
CHANGE COLUMN <original_column_name> RENAME TO <new_column_name>;

Structure Modification of Complex Data Types in MaxCompute

The schema of nested columns in each layer of complex data types also supports flexible modification. Nested columns and table structures can enjoy the convenience of optimized performance of column storage and direct query.

CREATE TABLE evol_t (id int, point struct<x: double, y: double>) ;
ALTER TABLE evol_t ADD COLUMNS (points_map map<string, struct<x: double, y: double>>);
ALTER TABLE evol_t ADD COLUMNS (points_arr array<struct<x: double, y: double>>);

Since all nested columns are treated and detected as a nested table, nested columns can also obtain the following capabilities:

Modification in the table structure (add column, delete column, change column name, change column order, and change data type)
Finer column storage and compression
Storage and compute optimization for data types
Computing using node values directly
Richer functions for semi-structured data processing

Processing Functions for Complex Data Types in MaxCompute

A variety of complex processing functions for data of complex types facilitate the direct processing of semi-structured data. Additional easy-to-use functions are coming soon. Learn more: https://www.alibabacloud.com/help/doc-detail/27827.htm

Higher-Order Functions of MaxCompute Support Lambda Expressions

The higher-order functions for processing data of complex types include:

ANY_MATCH, ALL_MATCH, ARRAY_REDUCE, ARRAY_SORT, FILTER, TRANSFORM, ZIP_WITH, MAP_FILTER, MAP_ZIP_WITH, TRANSFORM_KEYS, TRANSFORM_VALUES

They support Lambda expressions, which simplify the expression of processing data of complex types. The descriptions and examples of some functions are listed below:

Determine whether any element in the array (1, 2, -10, 100, -30) meets the condition of x-> x > 3. The following statements show an example:
-- Return true. 
selectany_match(array(1, 2, -10, 100, -30), x-> x > 3);
Filter the elements in array a by using func to return a new array. 
-- Return [2,3]. 
selectfilter(array(1, 2, 3), x -> x > 1);
Merge the elements of array a and array b by position by using combiner at the element level to return a new array. 
-- Return [2, 4, 6, NULL]. 
selectzip_with(array(1,2,3), array(1,2,3,4), (x,y) -> x + y);
Filter the input elements of the MAP object and retain only the elements that meet the predicate condition. 
-- Return {-30:100, 20:50}. 
selectmap_filter(map(10, -20, 20, 50, -30, 100, 21, null), (k, v) -> (k+v) > 10);
Transform the input elements of the MAP object, keep keys unchanged, and use func to calculate the new values. 
-- Return {-30:71, 10:-10, 20:NULL}. 
selecttransform_values(map(10, -20, 20, null, -30, 101), (k, v) -> k + v);

3. Practical Demonstration

Semi-Structured Data Processing and Schema Evolution of MaxCompute

Click here to view the demo.

Note: The language in the video is Chinese, but a large part of the information in this article is related to it. Please use it as a reference for the related information.

Feature Description

We list several common examples of commands to modify the node schema of complex types below:

-- Add a column to struct.
ALTER TABLE evol_t ADD COLUMNS (point.z double);
-- The value of map is struct. Add a new column.
ALTER TABLE evol_t ADD COLUMNS (points_map.value.z double);
-- The element of array is struct. Add a new column.
ALTER TABLE evol_t ADD COLUMNS (points_arr.element.z double);

-- Some usage examples
-- Add column
ALTER TABLE evol_t2 ADD COLUMNS (phones.element.type2 string);
-- Delete column
ALTER TABLE evol_t2 DROP COLUMNS (phones.element.type);
-- Change name
ALTER TABLE evol_t2 CHANGE COLUMN phones.type2 phones.type0 string;
-- Change order
ALTER TABLE evol_t2 CHANGE phones.num phones.num string AFTER type0;

Demo Script of the Complete Feature

DROP  table evol_t2;

CREATE TABLE evol_t2 (id int, name struct<given: string, family: string>,phones array<struct<type: string, num: string>>) ;
insert into table evol_t2  select 1, STRUCT('Bill', 'Gates'), array(STRUCT('work', '1234567890'),STRUCT('cell', '9876543210'));
insert into table evol_t2  select 2, STRUCT('Michael', 'Jordan'), array(STRUCT('work', '1111111111'),STRUCT('cell', '9999999999'));
select * from evol_t2;

ALTER TABLE evol_t2 ADD COLUMNS (position map<string, struct<x: double, y: double>>);
insert into table evol_t2  select 3, STRUCT('Michael', 'Jackson'), array(STRUCT('work', '1231231231'),STRUCT('cell', '1231231233')),map('p1',struct(1.1,1.2),'p2',struct(1.5,1.3));
select * from evol_t2;

ALTER TABLE evol_t2 ADD COLUMNS (position.value.z double);
insert into table evol_t2  select 4, STRUCT('Ming', 'Yao'), array(STRUCT('work', '5555555555'),STRUCT('cell', '6666666666')),map('p1',struct(5.5,1.0,12.0),'p2',struct(6.5,3.0,8.1));
select * from evol_t2;

ALTER TABLE evol_t2 DROP COLUMNS (phones.element.type); --Delete column
select * from evol_t2;

ALTER TABLE evol_t2 ADD COLUMNS (phones.element.type2 string);  --Add column
select * from evol_t2;

ALTER TABLE evol_t2 CHANGE COLUMN phones.type2 phones.type0 string; --Change name
select * from evol_t2;

insert into table evol_t2  select 5, STRUCT('Lei', 'Li'), array(STRUCT('9999999999','work'),STRUCT('8888888888','cell')),map('p1',struct(9.5,6.0,10.0),'p2',struct(5.5,2.0,3.0));
select * from evol_t2;

ALTER TABLE evol_t2 CHANGE phones.num phones.num string AFTER type0; --Change order
select * from evol_t2;

select name.given as firstname,c.phones[1].num as phonenum
from evol_t2 c
where c.phones[1].type0 = 'cell';


select c.name.family||c.name.Given,c.phones[1].num,SQRT(POW(position['p2'].x-position['p1'].x,2)+POW(position['p2'].y-position['p1'].y,2)+POW(position['p2'].z-position['p1'].z,2))
from evol_t2 c
where name.given in ('Ming','Lei');

Currently Supported Conversion Relationships of Data Types

In the near future, a feature for type modification will be launched in a gray release. The following data type conversion relationships are supported:

4. Evolution Direction

Continually Enhanced Functions and Evolution Direction

More processing functions for complex data types
More compatible conversion relationships for data types
Identify schemas of complex data types and optimize after being stored automatically
More flexible extraction and calculation of node values
Column analysis capabilities with higher performance