All Products
Search
Document Center

PolarDB:Efficiently analyze JSON-formatted data

Last Updated:Dec 15, 2025

This topic describes the extended stream computing solution that is developed based on the In-Memory Column Index (IMCI) feature of PolarDB to analyze large amounts of structured data and semi-structured data. This solution integrates various features of IMCI such as column-oriented storage for JSON-formatted data, virtual columns, instant DDL, and the extended maximum number of columns in a table. This topic also describes some customer use cases.

Background information

To adapt to diversified and rapidly iterative scenarios, most business systems are built to store and analyze semi-structured data. The IMCI feature of PolarDB provides comprehensive features such as virtual columns and column-oriented storage for JSON-formatted data. IMCI can quickly process big data and multiple types of data including structured data and semi-structured data. IMCI also provides efficient data analysis, data query, and stream computing capabilities. It is suitable for scenarios such as data analysis, data warehousing, and extended stream computing.

This topic describes how to compute and analyze large amounts of semi-structured data, such as JSON-formatted data, by using the real-time data analysis and extended stream computing solutions developed based on the IMCI feature of PolarDB. Before that, this topic describes the solutions to semi-structured data analysis in traditional databases and data warehouses, and also describes the detailed features of IMCI provided by PolarDB, including column-oriented storage for JSON-formatted data, virtual columns, instant DDL, and the extended maximum number of columns in a table.

Solutions

JSON-formatted data is analyzed in most business systems due to the flexibility of semi-structured data and the needs for efficient analysis. Therefore, flexibility and efficiency are key metrics to measure the performance of solutions to semi-structured data analysis.

Solution in traditional databases

In traditional relational databases such as MySQL, PostgreSQL, and ClickHouse databases, raw JSON-formatted data is encoded into binary data and then stored in the JSON columns of tables. During queries, JSON functions are used to parse and compute all data in JSON columns in real time.

JSON-formatted data is a type of semi-structured data. You can add, remove, and modify the attributes of JSON-formatted data based on your business requirements. If business requirements change, you need to only dynamically add, remove, and modify relevant attributes for the new data in JSON columns in your business system, without the need to modify the schemas of tables. This effectively reduces the cost of maintaining and managing table schemas. However, all data in JSON columns is parsed in real time during queries. This consumes a large amount of I/O resources and may cause repeated parsing and computing. In addition, you cannot create or use secondary indexes for specified fields in JSON columns.

select product.item->"$.name"
from product, purchase
where product.id = purchase.item->"$.id"
group by product.item->"$.name";

For example, the preceding SQL statement is executed in a traditional database to query data based on a nested loop join. Each time a row of data is read from the product table, all data is read from the item column of the purchase table. All data in the item column, which is a JSON column, is repeatedly parsed to extract data from specified fields. This lowers the query efficiency.

Solution in traditional data warehouses

The data processing process in a data warehouse consists of the following steps:

  1. Data extraction: extracts required data from various data sources, including databases, files, and web services, and cleanses, transforms, and filters the extracted data.

  2. Data transformation: transforms the extracted data to ensure that the data conforms to the data model and specifications of the data warehouse. Relevant operations include data cleansing, data integration, data transformation, data enhancement, and data aggregation.

  3. Data loading: loads the transformed data into the data warehouse, including dimension tables and fact tables.

  4. Data management: manages the data in the data warehouse, including data backup, data restoration, and data security.

  5. Data analysis: performs multidimensional analysis on the data in the data warehouse, including data queries, report generation, and data mining.

In most cases, the production data of a business system is processed by extract, transform, and load (ETL) jobs based on business requirements before the data is imported to a data warehouse.

During data processing in a data warehouse, JSON-formatted data is pre-parsed in ETL jobs to compute the corresponding values as needed, and then the values are inserted into a separate column of a table. This way, the attributes of JSON-formatted data can be processed based on business requirements and inserted into the columns of a large wide table. This improves the query performance. During queries, data can be read from regular columns, without the need to read or parse all data in JSON columns. This saves a large amount of I/O resources. In addition, you can create and use indexes for regular columns to effectively improve the query performance.

However, if you add, remove, or modify the attributes of JSON-formatted data based on business requirements in your business system, the ETL jobs and table schemas in your data warehouse must be modified to adapt to the upstream production data. Operations such as republishing ETL jobs and executing DDL statements to add or remove columns are performed to maintain the logic of ETL jobs and business table schemas. The frequent publishing of ETL jobs affects data consumption in the upstream and data warehousing in the downstream. In addition, modifying the schemas of large tables is costly without the instant DDL feature and also affects normal data queries.

The solution to semi-structured data analysis in traditional data warehouses can provide efficient data queries, but lacks flexibility and requires high maintenance costs.

IMCI solution

The solutions to semi-structured data analysis in traditional databases and data warehouses cannot meet the requirements for query performance and flexible architecture at the same time. To meet the urgent need for a new solution in the industry, Alibaba Cloud provides the IMCI feature of PolarDB.

This section describes various features provided by the IMCI feature of PolarDB, including column-oriented storage for JSON-formatted data, virtual columns, instant DDL, and the extended maximum number of columns in a table.

Column-oriented storage for JSON-formatted data

As a form of data between structured data and unstructured data, semi-structured data is partially structured and lacks a specific relational or tabular data model. Semi-structured data can be described and organized by using structural elements such as tags, markers, and metadata, but its structure and organization can also be dynamically adjusted as the content of the data changes. HTML code on web pages, XML documents, JSON-formatted data, and data in NoSQL databases are examples of semi-structured data. The flexibility and scalability of semi-structured data make it an indispensable part of the big data era.

PolarDB for MySQL is a relational database management system that stores structured data. PolarDB for MySQL also supports the storage and query of semi-structured data, such as data in the XML or JSON format. The IMCI feature of PolarDB fully supports JSON-formatted data and column-oriented JSON functions. It uses the binary JSON format to store semi-structured data, and uses column-oriented JSON functions to parse, query, modify, and delete JSON-formatted data. IMCI is fully compatible with MySQL syntax.

The IMCI feature of PolarDB uses a simple binary format to store JSON-formatted data by column, and uses RapidJSON to parse JSON-formatted data. During data processing, data is read on demand and the compression technology for column-oriented storage is used. This effectively reduces the amount of consumed I/O resources. Also, technologies such as single instruction multiple data (SIMD), vectorization, and parallelism are used to accelerate computing.

In this example, test data is used to show you how to use column-oriented storage for JSON-formatted data and compare the query performance between row-oriented storage and column-oriented storage.

  1. When you create a table, add a JSON column and create an IMCI on the JSON column.

create table produce (
 id bigint(20) NOT NULL,
 attributes json DEFAULT NULL
) comment='columnar=1';
  1. Use a column-oriented JSON function to query data.

select count(*)
from produce
where attributes->"$.delivery.width" > 10 and attributes->"$.delivery.height" > 10 and attributes->"$.delivery.weight" > 10;

The following sample code shows the execution plan of the query:

Project | Exprs: temp_table1.COUNT(0)
 HashGroupby | OutputTable(1): temp_table1 | Grouping: None | Output Grouping: None | Aggrs: COUNT(0)
 CTableScan | InputTable(0): produce | Pred: ((JSON_EXTRACT(produce.attributes, "$.delivery.width") > "10(json)") AND (JSON_EXTRACT(produce.attributes, "$.delivery.height") > "10(json)") AND (JSON_EXTRACT(produce.attributes, "$.delivery.weight") > "10(json)"))

The following table compares the duration for executing SQL statements on the produce table that contains tens of millions of data entries in PolarDB between row-oriented storage and column-oriented storage.

Storage model

SQL execution duration

Row-oriented storage

9.29 seconds

Column-oriented storage with IMCI enabled (32 cores)

0.14 seconds

The preceding test results show that the query performance of column-oriented storage is higher than that of row-oriented storage by close to two orders of magnitude. The IMCI feature of PolarDB can analyze JSON-formatted data more efficiently. The query performance of online business may vary based on datasets and query modes. The test data is for reference only.

Virtual columns

As a special type of column, virtual columns do not have their values inserted or updated, but dynamically computed, merged, or filtered based on the values of other columns in the same table. Virtual columns can be used for data queries and indexing, but cannot be directly modified or removed. Virtual columns allow you to quickly access and process data, without the need to compute data for each query. This optimizes data queries and simplifies operations.

To implement virtual columns, the IMCI feature of PolarDB supports two types of generated columns: virtual generated columns and stored generated columns. By default, virtual generated columns are used. The computed values of virtual generated columns are persistently stored in column-oriented tables, but not in row-oriented tables. Each time a row-oriented table is read, the values of virtual generated columns are computed in real time. The computed values of stored generated columns are persistently stored in column-oriented tables and row-oriented tables, but occupy more disk space. We recommend that you use virtual generated columns when you use the IMCI feature of PolarDB. Virtual generated columns not only save disk space but also improve the performance of column-oriented storage.

The following sample code shows the syntax used to define a virtual column:

col_name data_type [GENERATED ALWAYS] AS (expr)
 [VIRTUAL | STORED] [NOT NULL | NULL]
 [UNIQUE [KEY]] [[PRIMARY] KEY]
 [COMMENT 'string']

In this example, test data is used to show you how to use virtual columns and compare the query performance between row-oriented storage and column-oriented storage.

  1. When you create a table, add a virtual column and create an IMCI on the virtual column.

create table produce (
 id bigint(20) NOT NULL,
 attributes json DEFAULT NULL,
 `delivery_volume` double GENERATED ALWAYS AS (((json_extract(`attributes`,'$.delivery.width') * json_extract(`attributes`,'$.delivery.height')) * json_extract(`attributes`,'$.delivery.weight'))) VIRTUAL
) comment='columnar=1';
  1. Query data on regular columns and the virtual column.

    • Query data on regular columns.

      select count(*)
      from produce
      where (attributes->"$.delivery.width" * attributes->"$.delivery.height" * attributes->"$.delivery.weight") > 1000;

      The following sample code shows the execution plan of the query:

      Project | Exprs: temp_table1.COUNT(0)
       HashGroupby | OutputTable(1): temp_table1 | Grouping: None | Output Grouping: None | Aggrs: COUNT(0)
       CTableScan | InputTable(0): produce | Pred: ((CAST JSON_EXTRACT(produce.attributes, "$.delivery.width")/JSON as DOUBLE(38, 31)) * (CAST JSON_EXTRACT(produce.attributes, "$.delivery.height")/JSON as DOUBLE(38, 31)) * (CAST JSON_EXTRACT(produce.attributes, "$.delivery.weight")/JSON as DOUBLE(38, 31)) > 1000.000000)

      The following table compares the duration for executing SQL statements on regular columns of the produce table that contains tens of millions of data entries in PolarDB between row-oriented storage and column-oriented storage.

      Storage model

      SQL execution duration

      Row-oriented storage

      13.43 seconds

      Column-oriented storage with IMCI enabled (one core)

      5.72 seconds

      Column-oriented storage with IMCI enabled (32 cores)

      0.24 seconds

    • Query data on the virtual column.

      select count(*)
      from produce
      where delivery_volume > 1000;

      The following sample code shows the execution plan of the query:

      Project | Exprs: temp_table1.COUNT(0)
       HashGroupby | OutputTable(1): temp_table1 | Grouping: None | Output Grouping: None | Aggrs: COUNT(0)
       CTableScan | InputTable(0): produce | Pred: (produce.delivery_volume > 1000.000000)

      The following table compares the duration for executing SQL statements on the virtual column of the produce table that contains tens of millions of data entries in PolarDB between row-oriented storage and column-oriented storage.

      Storage model

      SQL execution duration

      Row-oriented storage

      14.30 seconds

      Column-oriented storage with IMCI enabled (one core)

      0.03 seconds

      Column-oriented storage with IMCI enabled (32 cores)

      0.01 seconds

The preceding test results show that the IMCI feature of PolarDB can effectively improve the query performance by using virtual columns. The query performance of online business may vary based on datasets and query modes. The test data is for reference only.

The virtual column feature of PolarDB for MySQL is flexible and powerful. When you process semi-structured data, such as JSON-formatted data, you can store irregular data as structured data by using virtual columns. This eliminates the need to run ETL jobs for data processing and allows you to use traditional SQL syntax for data queries and analysis. Virtual columns simplify complex computing and queries, improve the flexibility of your database architecture, and avoid data redundancy caused by row-oriented storage. After you create IMCIs on virtual columns, you can use the pruner mechanism of column-oriented storage to filter data, which effectively improves the query performance.

Instant DDL

If you need to add or remove JSON columns for semi-structured data based on your business requirements, you can execute DDL statements to add columns to or remove columns from a table. In this case, a feature is required to help you efficiently add or remove columns. Each time the structure of semi-structured data changes, you do not need to modify the schema of your table. If data is not frequently queried on your table, you can use column-oriented JSON functions to compute data in real time for each query. In most cases, real-time computing can meet your query performance requirements if the IMCI feature of PolarDB is used for JSON-formatted data.

The IMCI feature of PolarDB provides the instant DDL feature to help you instantly add or remove virtual columns, during which read and write operations can be performed and normal data queries are not affected.

The following sample code shows the syntax used to add a virtual column by using the instant DDL feature:

alter table produce add column delivery_volume DOUBLE AS (attributes->"$.delivery.width" * attributes->"$.delivery.height" * attributes->"$.delivery.weight");

The following sample code shows the syntax used to remove a virtual column by using the instant DDL feature:

alter table produce drop column delivery_volume;

Extended maximum number of columns in a table

After the attributes of semi-structured data are processed and inserted into a large wide table as virtual columns, the number of columns in the large wide table is growing as the attributes of semi-structured data increase. The maximum number of columns in a table of a native MySQL database depends on the limits of the storage engine used by the table. For example, the InnoDB storage engine supports a maximum of 1,017 columns in a table. For row-oriented tables, the limit on the maximum number of columns in a table meets most business requirements. During table schema design for a relational database, large wide tables are seldom used. A table that contains a large number of columns consumes more I/O and memory resources during queries but lowers the query efficiency. For example, if you need to read data from specific columns, you must read all rows. In most cases, large tables are split or small tables are joined to optimize table schemas. If column-oriented storage is used, large wide tables are preferred because they can improve the query efficiency and prevent table joins. In column-oriented tables, data is stored by column to achieve better compression effects. If you need to read data from specific columns, you need to only read these columns. This effectively reduces the amount of consumed I/O resources.

During the processing of semi-structured data, the IMCI feature of PolarDB converts some attributes of semi-structured data as needed and inserts the converted data into separate virtual columns of a table. If semi-structured data contains a large number of attributes, the number of columns in the table may exceed the limit. Therefore, the IMCI feature of PolarDB extends the maximum number of columns in a column-oriented table that uses the InnoDB storage engine to 4,089.

Real-time data analysis

To analyze semi-structured data, the IMCI feature of PolarDB implements features such as column-oriented storage for JSON-formatted data and virtual columns. In this section, the real-time data analysis solution developed based on the IMCI feature of PolarDB is tested by using the real-time event data of GitHub in July 2023.

You can download the real-time event data of GitHub in the JSON format from GH Archive. For example, you can run a wget command to download the data for each hour in July 2023. After the data is downloaded, parse the data and insert the data into a table named github_events.

Define the github_events table based on GitHub event types.

create table github_events (id bigint, type varchar(16), public bit, payload json, repo json, actor json, org json, created_at datetime);

Select SQL statements from Everything You Always Wanted To Know About GitHub and rewrite the SQL statements to perform two queries.

Query the most popular programming languages in a week.

SELECT
    repo_language AS language,
    count(*) AS total
FROM
    github_events
WHERE
    created_at >= "2023-07-25 00:00:00"
    AND created_at <= "2023-07-31 23:59:59"
    AND repo_language IS NOT NULL
GROUP BY
    repo_language
ORDER BY
    total DESC
LIMIT 10;

Rank all Linux repositories based on the number of stars.

SELECT repo_name, count(*) AS stars
FROM github_events
WHERE (type = 'WatchEvent') AND (actor_login IN
(
    SELECT actor_login
    FROM github_events
    WHERE (type = 'WatchEvent') AND (repo_name IN ('torvalds/linux')) AND created_at >= "2023-07-31 00:00:00" AND created_at <= "2023-07-31 23:59:59"
)) AND (repo_name NOT IN ('torvalds/linux')) AND created_at >= "2023-07-31 00:00:00" AND created_at <= "2023-07-31 23:59:59"
GROUP BY repo_name
ORDER BY stars DESC
LIMIT 10;

Add virtual columns such as actor_login, repo_name, and repo_language to the github_events table based on the preceding SQL statements and create an IMCI.

alter table github_events add column actor_login varchar(256) generated always as (json_unquote(json_extract(`actor`,'$.login'))) virtual, add column repo_name varchar(256) generated always as (json_unquote(json_extract(`repo`,'$.name'))) virtual, add column repo_language varchar(32) generated always as (json_unquote(json_extract(`payload`,'$.pull_request.base.repo.language'))) virtual, comment 'columnar=1';

Set the cache size to 500 GB for row-oriented storage and 128 GB for column-oriented storage. The following table compares the duration for executing SQL statements on hot data between row-oriented storage and column-oriented storage.

Storage model

SQL execution duration of Query 1

SQL execution duration of Query 2

Row-oriented storage

45.01 seconds

8.57 seconds

Column-oriented storage with IMCI enabled (one core)

0.79 seconds

0.54 seconds

Column-oriented storage with IMCI enabled (32 cores)

0.04 seconds

0.07 seconds

The preceding test results show that the real-time data analysis solution developed based on the IMCI feature of PolarDB delivers far better performance than row-oriented storage. You can use this solution to efficiently analyze large amounts of semi-structured data.

Extended stream computing

The extended stream computing solution is developed based on the IMCI feature of PolarDB as an automated solution to help you analyze large amounts of semi-structured data. This solution integrates various features of IMCI described in this topic, such as column-oriented storage for JSON-formatted data, virtual columns, instant DDL, and the extended maximum number of columns in a table.

Stream computing is a real-time data processing technology that computes and analyzes data in real time based on continuous data streams. As a type of lightweight stream computing, extended stream computing focuses on processing data streams and providing computing results in real time. In the meantime, extended stream computing aims to minimize the use of computing resources and reduce system complexity. Compared with traditional stream computing, extended stream computing pays more attention to lightweight, quick response, and automation.

The extended stream computing solution developed based on the IMCI feature of PolarDB uses SQL statements to define the processing logic of data streams as expressions or functions in the schemas of column-oriented tables and record the processing logic as virtual columns. Then, the extended stream computing framework automatically computes data in real time based on business data streams and persistently stores the computed data in the column-oriented tables. During queries, IMCIs are used to read the computed data. The extended stream computing process is integrated with the IMCI feature of PolarDB. You need to only execute DDL statements to define the processing logic of different data streams as virtual columns. If business requirements change, you need to only execute DDL statements to add or modify virtual columns.

To analyze large amounts of semi-structured data, you can use JSON functions and the attributes of JSON-formatted data to define the processing logic of data streams as virtual columns based on your business requirements, and create IMCIs on the virtual columns. Then, you can use the IMCI feature of PolarDB to automatically compute and store data in real time based on continuous data streams and continuously update large wide tables. During queries, you can use IMCIs to query data on specified virtual columns. This prevents all data in JSON columns from being repeatedly read or parsed, and effectively improves the query efficiency. If you use JSON functions to query data, the PolarDB optimizer first queries whether virtual columns are matched based on JSON functions and JSON columns. Matched virtual columns are preferentially selected and queried to improve the query performance. After you add, remove, or modify the attributes of JSON-formatted data based on business requirements in your business system, you need to only use the instant DDL feature to add or remove columns. This eliminates the need to run ETL jobs as you do in traditional data warehouses and helps you instantly modify table schemas. This way, you can flexibly respond to changes in business requirements without additional maintenance costs, and your business is not affected. If the updated data is not frequently queried, you do not need to modify table schemas. You can use column-oriented JSON functions to parse data in real time for each query. This also meets most business requirements.

image

To sum up, the extended stream computing solution developed based on the IMCI feature of PolarDB has both the flexibility of traditional databases and the efficiency of traditional data warehouses. This solution is more suitable if you need to analyze large amounts of semi-structured data.

Customer use cases

Streaming platform

A streaming platform is one of the most popular online video platforms in China. This platform provides videos and features such as movies, TV dramas, variety shows, and live streaming. Users can subscribe to the platform as paid subscribers to get better viewing experience.

On the basis of a large number of subscribers, the amount of transaction data rapidly expands every day. The transaction data of subscribers is separately stored in real-time tables. The data is used to implement features such as business compensation and verification, real-time monitoring of order status, automatic order management, and automatic processing of subscriber benefits.

image

The original database system uses sharding in MySQL to process the transaction data of subscribers. Databases are deployed in a MySQL cluster whose architecture consists of one primary node and multiple read-only nodes. As the amount of transaction data expands, the number of partitions and read-only nodes is increased to scale out the MySQL cluster. Data is synchronized between nodes by using binary logs. To cope with rapid iteration, a large amount of semi-structured data such as JSON-formatted data is generated and processed in the database system. With the rapid development of business, high-concurrency queries are performed on JSON-formatted data in large tables. To process such queries, the number of databases and partitions is continuously increased in the existing database architecture. This results in complex O&M operations and higher O&M costs, and may compromise the query performance.

In this case, the streaming platform needs a new database architecture with simplified O&M operations to reduce costs and increase efficiency. With the help of Alibaba Cloud, the new database architecture adopts the all-in-one hybrid transactional/analytical processing (HTAP) solution of PolarDB. The solution takes full advantage of extended stream computing and column-oriented storage to improve the analysis performance of large amounts of JSON-formatted data and effectively simplify O&M operations.

image

Financial e-commerce platform

A financial e-commerce platform is run by a financial technology company headquartered in Singapore and established in 2016. This platform provides consumer financial services in Southeast Asia. The provided services include installment payments, credit loans, virtual credit cards, and e-commerce services. The platform has millions of users in Southeast Asian countries such as Indonesia, the Philippines, Malaysia, Thailand, and Vietnam, and has become one of the leading consumer financial service providers in Southeast Asia.

The order business systems of the platform can be divided into the transaction processing and data analysis systems. The transaction processing system uses JSON-formatted data to store various business attributes, whereas the data analysis system needs to compute and analyze a large number of business attributes.

image

In the original database system, a MySQL cluster is deployed to process business transactions, and a ClickHouse cluster is deployed to analyze business data. Apache Flink is used to subscribe to MySQL data streams, convert JSON-formatted data into structured data to construct large wide tables, and then insert the tables into the ClickHouse cluster in real time. Due to the complexity of the existing database architecture, issues such as data loss during data processing, delayed data analysis, and unsatisfactory query performance may occur. If you add, remove, or modify the attributes of JSON-formatted data based on your business requirements, you must modify the processing logic of Flink and the schemas of ClickHouse tables. However, ClickHouse does not support the instant DDL feature. This extends the period of time required to modify large tables and affects the query performance. As business develops, the architects of the platform are in urgent need of a new database architecture that is characterized by system stability, simplified O&M operations, performance analysis, and cost savings.

The cloud-native HTAP solution of PolarDB that integrates real-time transaction processing and real-time data analysis is the best choice. Compared with the original database system that integrates multiple systems such as MySQL, Data Transmission Service (DTS), Apache Flink, and ClickHouse, the new database system has a simple architecture based only on PolarDB. In the new database architecture, the extended stream computing solution developed based on the IMCI feature of PolarDB is used instead of Apache Flink to automatically convert semi-structured data into structured data. In addition, the instant DDL feature is used to simplify O&M operations, the extended maximum number of columns in a table supports rapid business development, and column-oriented storage improves the analysis performance.