Use IMCI for efficient semi-structured data analysis - PolarDB

Analyzing large JSON datasets in MySQL-compatible systems has historically required a trade-off: either accept slow real-time parsing in traditional databases, or run extract, transform, and load (ETL) pipelines into a data warehouse and pay the cost of schema rigidity. PolarDB's In-Memory Column Index (IMCI) eliminates this trade-off — you store JSON as-is and query it with column-oriented performance, adding structure incrementally as query patterns stabilize.

This topic covers the IMCI capabilities that enable JSON analysis: column-oriented JSON storage, virtual columns, instant DDL, and the extended column limit. It also includes performance benchmarks and two customer migration case studies.

How it works

IMCI stores JSON columns in a binary JSON format and uses RapidJSON for parsing. During queries, only the accessed JSON paths are read. Combined with compression, single instruction multiple data (SIMD), vectorization, and parallelism, this approach cuts I/O compared to parsing full JSON rows.

IMCI is fully compatible with MySQL syntax. You query JSON columns using standard MySQL JSON path expressions — no schema changes required to start.

The progressive path:

Stage	Approach	When to use
Start	Query JSON columns directly with JSON functions	Data is new, schema is still evolving, or the field is queried infrequently
Optimize	Create virtual columns on high-frequency JSON fields	Query patterns have stabilized and specific fields are queried often
Manage	Add or remove virtual columns with instant DDL	Business requirements change; no ETL republishing needed

IMCI's query optimizer checks whether virtual columns are matched based on JSON functions and JSON columns. Matched virtual columns are preferentially selected and queried to improve query performance, so adding virtual columns later does not require rewriting application queries.

Choose your approach

Not every JSON field needs a virtual column. The decision depends on how often a field is queried:

Scenario	Recommended approach
Schema is still evolving	Query directly with column-oriented JSON functions. No DDL needed.
Specific fields queried ad hoc or infrequently	Query with JSON functions. IMCI handles both paths efficiently.
Specific fields queried frequently in production	Create virtual columns on those fields.
JSON field involves a complex computed expression	Create a virtual column that pre-computes the value.

When to avoid virtual columns: If a JSON attribute changes frequently or is only ever retrieved as a whole document, the overhead of maintaining a virtual column provides little benefit.

Key features

Column-oriented storage for JSON

IMCI stores JSON data by column, not by row. During queries, only the relevant JSON paths are scanned. This eliminates the full-row parsing overhead that limits performance in traditional row-oriented databases.

When to use: Any JSON column that is analyzed, filtered, or aggregated. No extra configuration is needed — enable IMCI on the table with comment='columnar=1'.

Virtual columns

A virtual column extracts a specific JSON field as a named, typed column that IMCI persists in the column store. Queries on virtual columns skip JSON parsing entirely and use IMCI's pruner mechanism for efficient filtering.

IMCI supports two types of generated columns:

Virtual generated columns (recommended): values are persisted in the column store only. Row-level reads compute the value on the fly. Saves disk space and improves column-oriented performance.
Stored generated columns: values are persisted in both stores, but use more disk space.

Use virtual generated columns unless row-level access to the derived value is critical.

Syntax:

col_name data_type [GENERATED ALWAYS] AS (expr)
  [VIRTUAL | STORED] [NOT NULL | NULL]
  [UNIQUE [KEY]] [[PRIMARY] KEY]
  [COMMENT 'string']

Instant DDL

Adding or dropping virtual columns takes effect immediately without blocking reads or writes. This makes schema changes safe to run in production as query patterns evolve, and eliminates the need to republish ETL pipelines when JSON attributes change.

-- Add a virtual column
ALTER TABLE produce ADD COLUMN delivery_volume DOUBLE
  AS (attributes->"$.delivery.width" * attributes->"$.delivery.height" * attributes->"$.delivery.weight");

-- Remove a virtual column
ALTER TABLE produce DROP COLUMN delivery_volume;

Extended column limit

InnoDB supports up to 1,017 columns per table. IMCI extends this limit to 4,089 columns for column-oriented tables, accommodating wide tables built from many JSON attributes.

Performance benchmarks

The following benchmarks use test data for reference. Actual performance varies by dataset, query pattern, and hardware configuration.

Column-oriented JSON storage

Table: produce with tens of millions of rows. Query filters on three nested JSON paths.

SELECT count(*)
FROM produce
WHERE attributes->"$.delivery.width" > 10
  AND attributes->"$.delivery.height" > 10
  AND attributes->"$.delivery.weight" > 10;

Storage model	Duration
Row-oriented storage	9.29 seconds
Column-oriented storage with IMCI (32 cores)	0.14 seconds

Virtual columns

The same produce table with a delivery_volume virtual column that pre-computes the product of three JSON fields.

Querying the raw JSON fields:

Storage model	Duration
Row-oriented storage	13.43 seconds
Column-oriented storage with IMCI (1 core)	5.72 seconds
Column-oriented storage with IMCI (32 cores)	0.24 seconds

Querying the virtual column directly:

SELECT count(*) FROM produce WHERE delivery_volume > 1000;

Storage model	Duration
Row-oriented storage	14.30 seconds
Column-oriented storage with IMCI (1 core)	0.03 seconds
Column-oriented storage with IMCI (32 cores)	0.01 seconds

Querying the virtual column directly is significantly faster than querying the raw JSON field, because IMCI skips JSON parsing and applies its pruner mechanism to pre-computed typed values.

Real-time data analysis example

This example demonstrates IMCI on GitHub event data from July 2023, sourced from GH Archive. The schema follows GitHub event types, and the analytical queries are adapted from Everything You Always Wanted To Know About GitHub.

Step 1: Set up the table

CREATE TABLE github_events (
  id bigint,
  type varchar(16),
  public bit,
  payload json,
  repo json,
  actor json,
  org json,
  created_at datetime
);

Download and load the hourly event files for July 2023, then parse and insert them into github_events.

Step 2: Identify high-frequency fields and create virtual columns

The two analytical queries below access actor.login, repo.name, and payload.pull_request.base.repo.language. These are good candidates for virtual columns. Enable IMCI and add all three virtual columns in a single statement:

ALTER TABLE github_events
  ADD COLUMN actor_login varchar(256)
    GENERATED ALWAYS AS (json_unquote(json_extract(`actor`, '$.login'))) VIRTUAL,
  ADD COLUMN repo_name varchar(256)
    GENERATED ALWAYS AS (json_unquote(json_extract(`repo`, '$.name'))) VIRTUAL,
  ADD COLUMN repo_language varchar(32)
    GENERATED ALWAYS AS (json_unquote(json_extract(`payload`, '$.pull_request.base.repo.language'))) VIRTUAL,
  COMMENT 'columnar=1';

Step 3: Run analytical queries

Query 1: Top programming languages (July 25–31, 2023)

SELECT
  repo_language AS language,
  count(*) AS total
FROM github_events
WHERE created_at >= '2023-07-25 00:00:00'
  AND created_at <= '2023-07-31 23:59:59'
  AND repo_language IS NOT NULL
GROUP BY repo_language
ORDER BY total DESC
LIMIT 10;

Query 2: Linux repository recommendations (July 31, 2023)

SELECT repo_name, count(*) AS stars
FROM github_events
WHERE type = 'WatchEvent'
  AND actor_login IN (
    SELECT actor_login
    FROM github_events
    WHERE type = 'WatchEvent'
      AND repo_name IN ('torvalds/linux')
      AND created_at >= '2023-07-31 00:00:00'
      AND created_at <= '2023-07-31 23:59:59'
  )
  AND repo_name NOT IN ('torvalds/linux')
  AND created_at >= '2023-07-31 00:00:00'
  AND created_at <= '2023-07-31 23:59:59'
GROUP BY repo_name
ORDER BY stars DESC
LIMIT 10;

Results (cache: 500 GB row-oriented, 128 GB column-oriented):

Storage model	Query 1	Query 2
Row-oriented storage	45.01 seconds	8.57 seconds
Column-oriented storage with IMCI (1 core)	0.79 seconds	0.54 seconds
Column-oriented storage with IMCI (32 cores)	0.04 seconds	0.07 seconds

Extended stream computing

Extended stream computing is IMCI's automated pipeline for continuously processing JSON data streams without ETL jobs.

Define transformation logic as virtual column expressions when creating or altering a table. IMCI computes and persists the derived values in real time as new rows arrive. Queries read from the column store directly, skipping JSON parsing.

When business requirements change:

Add a new JSON field to track: run an ALTER TABLE ... ADD COLUMN with instant DDL. The new virtual column is computed and stored going forward.
Remove a field: run ALTER TABLE ... DROP COLUMN. Both operations complete without blocking reads or writes.
Field queried infrequently: skip creating a virtual column and query with JSON functions directly. IMCI handles both paths efficiently.

This replaces the ETL republish cycle required in traditional data warehouse architectures.

Customer use cases

Streaming platform

One of the most popular online video platforms in China — offering movies, TV dramas, variety shows, and live streaming — manages subscriber transaction data to power business compensation, order status monitoring, and subscriber benefit processing.

Original architecture: MySQL cluster with sharding (one primary node, multiple read-only nodes). As subscriber data grew, the team added partitions and read-only nodes to handle high-concurrency queries on JSON columns in large tables. This increased operations and maintenance (O&M) complexity without resolving the underlying performance bottleneck.

New architecture: PolarDB all-in-one hybrid transactional/analytical processing (HTAP) solution. Extended stream computing processes JSON data in real time, column-oriented storage accelerates analytical queries, and a single PolarDB cluster replaces the fragmented MySQL shards.

Financial e-commerce platform

A fintech company headquartered in Singapore, founded in 2016, provides consumer financial services — installment payments, credit loans, virtual credit cards, and e-commerce — to millions of users across Indonesia, the Philippines, Malaysia, Thailand, and Vietnam.

Original architecture: MySQL for transaction processing, ClickHouse for analytics, Apache Flink for streaming JSON-to-structured conversion, and Data Transmission Service (DTS) for data synchronization. This stack produced data loss, analysis delays, and unsatisfactory query performance. Adding or removing JSON attributes required modifying both Flink logic and ClickHouse table schemas — a slow process because ClickHouse does not support instant DDL.

New architecture: PolarDB cloud-native HTAP solution. IMCI's extended stream computing replaces Apache Flink for JSON transformation. Instant DDL handles schema changes without downtime. The extended 4,089-column limit supports the wide tables required for analytical workloads. The entire multi-system stack — MySQL, DTS, Flink, and ClickHouse — consolidates into a single PolarDB cluster.