Speed Up UV Metrics with Intermediate Tables in MaxCompute - MaxCompute

When e-commerce companies build data warehouses, they often need to calculate rolling-window metrics — such as unique visitors (UV), buyer counts, and regular buyer counts over the past 30 days. These metrics are calculated against data accumulated over a long period, and naive rolling-window queries become prohibitively slow or fail entirely at scale.

The root cause is always the same: too much data is recomputed from scratch on every run. The two optimization approaches in this topic share a common principle — avoid full recomputation by pre-aggregating or incrementally accumulating data — and differ only in how far they take that principle.

All code samples use scheduling variables from DataWorks (e.g., ${bdp.system.bizdate}). The samples apply only to scheduling nodes in DataWorks.

The problem with rolling-window queries

A typical 30-day UV query looks like this:

SELECT item_id,
       COUNT(DISTINCT visitor_id) AS ipv_uv_1d_001
FROM vistor_item_detail_log
WHERE ds <= ${bdp.system.bizdate}
  AND ds >= TO_CHAR(DATEADD(TO_DATE(${bdp.system.bizdate},'yyyymmdd'),-29,'dd'),'yyyymmdd')
GROUP BY item_id;

This query scans 30 partitions of raw log data on every run. If your log volume is large, MaxCompute may need to spawn more than 99,999 map tasks — at which point the job fails. Even below that limit, scanning 30 days of raw logs every day is wasteful: most of the data has not changed since the previous run.

Choose an approach

Two optimization patterns address this problem. Pick based on how your data and queries are structured:

Approach	How it works	Use when
Intermediate table	Deduplicate and aggregate raw logs daily into a summary table; query the summary table instead of raw logs	Daily log volume is large; data per partition changes significantly between runs
Incremental accumulation	Merge all historical partitions into a single partition and append new data daily	You query the same rolling window repeatedly; reducing multi-partition scans matters more than daily deduplication

For most e-commerce workloads, start with the intermediate table approach. Move to incremental accumulation if multi-partition scans remain the bottleneck after the intermediate table is in place.

Approach 1: Intermediate table

How it works

Instead of querying 30 raw log partitions, pre-aggregate each day's logs into a summary table. Each row in the summary table represents a unique (item_id, visitor_id) pair for that day. Querying 30 partitions of pre-aggregated data is far cheaper than querying 30 partitions of raw logs, because the pre-aggregation step removes high-cardinality duplicate rows before the rolling-window query runs.

Step 1: Create the daily summary

Run this as a daily scheduled node in DataWorks, before your metrics nodes:

INSERT OVERWRITE TABLE mds_itm_vsr_xx (ds='${bdp.system.bizdate} ')
SELECT item_id,
       visitor_id,
       COUNT(1) AS pv
FROM (
    SELECT item_id,
           visitor_id
    FROM vistor_item_detail_log
    WHERE ds = ${bdp.system.bizdate}
    GROUP BY item_id, visitor_id
) a;

This writes one deduplicated row per (item_id, visitor_id) pair into mds_itm_vsr_xx for today's date partition.

Step 2: Query from the summary table

With the summary table in place, your rolling-window query scans far less data:

SELECT item_id,
       COUNT(DISTINCT visitor_id) AS uv,
       SUM(pv) AS pv
FROM mds_itm_vsr_xx
WHERE ds <= '${bdp.system.bizdate} '
  AND ds >= TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate} ','yyyymmdd'),-29,'dd'),'yyyymmdd')
GROUP BY item_id;

Important

This approach requires a dedicated daily node to populate mds_itm_vsr_xx. Configure that node as an upstream dependency of your metrics nodes in DataWorks so that the summary table is always up to date before downstream queries run.

Approach 2: Incremental accumulation

How it works

The intermediate table approach still scans 30 partitions every run. Incremental accumulation goes further: merge all historical data into a single accumulation partition and append only new data each day. Rolling-window queries then read from just one partition.

This is most effective when the percentage of data that changes daily is small relative to the total historical dataset — that is, when most of the data in the accumulation partition is stable across runs.

Implementation: regular buyer dimension table

The regular buyer use case — buyers who purchased within the last 30 days — is a good example. A naive approach scans 30 partitions of billing logs on every run:

SELECT item_id,
       buyer_id AS old_buyer_id
FROM buyer_item_detail_log
WHERE ds < ${bdp.system.bizdate}
  AND ds >= TO_CHAR(DATEADD(TO_DATE(${bdp.system.bizdate},'yyyymmdd'),-29,'dd'),'yyyymmdd')
GROUP BY item_id,
         buyer_id;

Instead, maintain a dimension table where each row represents the relationship between a buyer and an item, recording fields such as first purchase time, last purchase time, total purchased items, and total spend. Update it daily with the previous day's billing logs.

To determine whether a buyer is a regular buyer, check whether their last_purchase_time falls within the last 30 days. This replaces a 30-partition scan of billing logs with a single-partition lookup against the dimension table, eliminating the full-scan deduplication on every run.

Performance considerations

Factor	Intermediate table	Incremental accumulation
Partitions scanned per run	30 (but with smaller rows)	1
Implementation complexity	Low — one extra daily node	Higher — merge logic and dependency management
Data freshness	Daily	Daily
Best fit	Large daily log volume with high-cardinality keys	Repeated rolling-window queries on stable historical data

Both approaches shift expensive computation from query time to ingestion time. The intermediate table does this per day; incremental accumulation does this across all historical days.