Diagnose and resolve job data expansion - MaxCompute

Data bloat occurs when a Fuxi task's output data volume is far larger than its input — for example, 1 GB in and 1 TB out. Processing that volume on a single instance significantly reduces data processing efficiency.

To identify data bloat, check the I/O Record and I/O Bytes attributes of Fuxi tasks on the Logview page.

Three patterns cause data bloat: a bug in the code, improper aggregation operations, and improper JOIN operations. The sections below explain how to recognize each pattern and fix it.

Fix bugs that produce a Cartesian product or invalid UDTF output

How to recognize it: Check whether your query contains a JOIN without a valid equality condition, or a user-defined table-valued function (UDTF) that emits far more rows than expected. If output volume is orders of magnitude larger than input and the query logic appears straightforward — no complex grouping, no multi-row dimension table — a code bug is the likely cause.

Why it happens:

The JOIN condition is incorrect and evaluates as a Cartesian product. Every row from the left table is paired with every row from the right table, causing the output to be much greater than the input.
A UDTF is invalid, causing it to emit far more rows than it should.

Fix: Review the JOIN condition and UDTF logic and correct the bug. The following example shows a common accidental Cartesian product and the corrected version:

-- Incorrect: missing join condition produces a Cartesian product
SELECT l.user_id, r.label
FROM user_data l
JOIN label_table r;

-- Correct: explicit equality condition
SELECT l.user_id, r.label
FROM user_data l
JOIN label_table r ON l.category_id = r.category_id;

Avoid aggregation operations that retain all intermediate results

How to recognize it: A query uses collect_list, median, DISTINCT across multiple dimensions, GROUPING SETS, CUBE, or ROLLUP, and output is much larger than input. If removing or replacing one of these operations brings output back to a normal size, the aggregation pattern is the cause.

Why it happens:

Most aggregation operations are recursive — they merge intermediate results as they go, so intermediate data stays small regardless of input size. A few operations break this pattern:

collect_list and median: These operations must retain all intermediate result data, not just a running aggregate. Combining them with other aggregation operations can cause significant data expansion.
SELECT with DISTINCT across dimensions: Each deduplication pass expands the data before filtering it down, and the expansion compounds across passes.
GROUPING SETS, CUBE, and ROLLUP: These operations generate intermediate result data for every grouping combination. Intermediate size can expand to many times the original data size.

Fix: Avoid combining collect_list or median with other aggregations where possible. Review queries that use GROUPING SETS, CUBE, or ROLLUP to confirm that the intermediate data volume is acceptable.

Aggregate right-table rows before performing a JOIN

How to recognize it: The query performs a JOIN on a low-cardinality key (such as gender or region), and the right table has many rows per key value. Check the number of rows per key value in the right table: if it is in the hundreds or more, this pattern is likely the cause.

Why it happens:

When you join on a low-cardinality key, each row in the left table is matched against every row in the right table that shares the same key value. For example:

Left table: a large amount of population data, joined on gender
Right table (dimension table): hundreds of rows per gender value

Each left-table row matches hundreds of right-table rows. The left table data may expand to hundreds of times its original size.

Fix: Aggregate the rows in the right table before performing the JOIN. Reduce the right table to one row per join key value first, then join.

-- Before: JOIN on raw dimension table (causes data bloat)
SELECT l.*, r.category
FROM population_data l
JOIN dimension_table r ON l.gender = r.gender;

-- After: aggregate right table first, then JOIN
SELECT l.*, r.category
FROM population_data l
JOIN (
    SELECT gender, MAX(category) AS category
    FROM dimension_table
    GROUP BY gender
) r ON l.gender = r.gender;