When e-commerce companies build data warehouses, they often need to calculate rolling-window metrics — such as unique visitors (UV), buyer counts, and regular buyer counts over the past 30 days. These metrics are calculated against data accumulated over a long period, and naive rolling-window queries become prohibitively slow or fail entirely at scale.
The root cause is always the same: too much data is recomputed from scratch on every run. The two optimization approaches in this topic share a common principle — avoid full recomputation by pre-aggregating or incrementally accumulating data — and differ only in how far they take that principle.
All code samples use scheduling variables from DataWorks (e.g., ${bdp.system.bizdate}). The samples apply only to scheduling nodes in DataWorks.
The problem with rolling-window queries
A typical 30-day UV query looks like this:
SELECT item_id,
COUNT(DISTINCT visitor_id) AS ipv_uv_1d_001
FROM vistor_item_detail_log
WHERE ds <= ${bdp.system.bizdate}
AND ds >= TO_CHAR(DATEADD(TO_DATE(${bdp.system.bizdate},'yyyymmdd'),-29,'dd'),'yyyymmdd')
GROUP BY item_id;
This query scans 30 partitions of raw log data on every run. If your log volume is large, MaxCompute may need to spawn more than 99,999 map tasks — at which point the job fails. Even below that limit, scanning 30 days of raw logs every day is wasteful: most of the data has not changed since the previous run.
Choose an approach
Two optimization patterns address this problem. Pick based on how your data and queries are structured:
| Approach | How it works | Use when |
|---|---|---|
| Intermediate table | Deduplicate and aggregate raw logs daily into a summary table; query the summary table instead of raw logs | Daily log volume is large; data per partition changes significantly between runs |
| Incremental accumulation | Merge all historical partitions into a single partition and append new data daily | You query the same rolling window repeatedly; reducing multi-partition scans matters more than daily deduplication |
For most e-commerce workloads, start with the intermediate table approach. Move to incremental accumulation if multi-partition scans remain the bottleneck after the intermediate table is in place.
Approach 1: Intermediate table
How it works
Instead of querying 30 raw log partitions, pre-aggregate each day's logs into a summary table. Each row in the summary table represents a unique (item_id, visitor_id) pair for that day. Querying 30 partitions of pre-aggregated data is far cheaper than querying 30 partitions of raw logs, because the pre-aggregation step removes high-cardinality duplicate rows before the rolling-window query runs.
Step 1: Create the daily summary
Run this as a daily scheduled node in DataWorks, before your metrics nodes:
INSERT OVERWRITE TABLE mds_itm_vsr_xx (ds='${bdp.system.bizdate} ')
SELECT item_id,
visitor_id,
COUNT(1) AS pv
FROM (
SELECT item_id,
visitor_id
FROM vistor_item_detail_log
WHERE ds = ${bdp.system.bizdate}
GROUP BY item_id, visitor_id
) a;
This writes one deduplicated row per (item_id, visitor_id) pair into mds_itm_vsr_xx for today's date partition.
Step 2: Query from the summary table
With the summary table in place, your rolling-window query scans far less data:
SELECT item_id,
COUNT(DISTINCT visitor_id) AS uv,
SUM(pv) AS pv
FROM mds_itm_vsr_xx
WHERE ds <= '${bdp.system.bizdate} '
AND ds >= TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate} ','yyyymmdd'),-29,'dd'),'yyyymmdd')
GROUP BY item_id;
This approach requires a dedicated daily node to populate mds_itm_vsr_xx. Configure that node as an upstream dependency of your metrics nodes in DataWorks so that the summary table is always up to date before downstream queries run.
Approach 2: Incremental accumulation
How it works
The intermediate table approach still scans 30 partitions every run. Incremental accumulation goes further: merge all historical data into a single accumulation partition and append only new data each day. Rolling-window queries then read from just one partition.
This is most effective when the percentage of data that changes daily is small relative to the total historical dataset — that is, when most of the data in the accumulation partition is stable across runs.
Implementation: regular buyer dimension table
The regular buyer use case — buyers who purchased within the last 30 days — is a good example. A naive approach scans 30 partitions of billing logs on every run:
SELECT item_id,
buyer_id AS old_buyer_id
FROM buyer_item_detail_log
WHERE ds < ${bdp.system.bizdate}
AND ds >= TO_CHAR(DATEADD(TO_DATE(${bdp.system.bizdate},'yyyymmdd'),-29,'dd'),'yyyymmdd')
GROUP BY item_id,
buyer_id;
Instead, maintain a dimension table where each row represents the relationship between a buyer and an item, recording fields such as first purchase time, last purchase time, total purchased items, and total spend. Update it daily with the previous day's billing logs.
To determine whether a buyer is a regular buyer, check whether their last_purchase_time falls within the last 30 days. This replaces a 30-partition scan of billing logs with a single-partition lookup against the dimension table, eliminating the full-scan deduplication on every run.
Performance considerations
| Factor | Intermediate table | Incremental accumulation |
|---|---|---|
| Partitions scanned per run | 30 (but with smaller rows) | 1 |
| Implementation complexity | Low — one extra daily node | Higher — merge logic and dependency management |
| Data freshness | Daily | Daily |
| Best fit | Large daily log volume with high-cardinality keys | Repeated rolling-window queries on stable historical data |
Both approaches shift expensive computation from query time to ingestion time. The intermediate table does this per day; incremental accumulation does this across all historical days.