Use WITH to write recursive and data-modifying common table expressions - PolarDB - Alibaba Cloud - PolarDB

WITH provides a way to write auxiliary statements for use in a larger query. These statements — called Common Table Expressions (CTEs) — define temporary tables that exist for the duration of a single query. Each auxiliary statement in a WITH clause can be a SELECT, INSERT, UPDATE, or DELETE, and the WITH clause can be attached to a primary statement that is also a SELECT, INSERT, UPDATE, or DELETE.

SELECT in WITH

The primary use of SELECT in WITH is to break a complicated query into simpler, named parts. For example:

WITH regional_sales AS (
    SELECT region, SUM(amount) AS total_sales
    FROM orders
    GROUP BY region
), top_regions AS (
    SELECT region
    FROM regional_sales
    WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)
SELECT region,
       product,
       SUM(quantity) AS product_units,
       SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;

This query displays per-product sales totals for only the top sales regions. The WITH clause defines two CTEs — regional_sales and top_regions — where top_regions consumes the output of regional_sales, and the primary SELECT consumes the output of top_regions. Without WITH, this would require two levels of nested subqueries.

Recursive queries

The optional RECURSIVE modifier enables queries that reference their own output — something not possible in standard SQL. A simple example sums the integers from 1 through 100:

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

Every recursive WITH query has this general form:

non-recursive term
UNION [ALL]
recursive term

Only the recursive term can reference the query's own output.

How recursive queries are evaluated

Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Place all remaining rows in the query result and in a temporary working table.
While the working table is not empty, repeat these steps:
1. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Place all remaining rows in the query result and in a temporary intermediate table.
2. Replace the working table contents with the intermediate table contents, then empty the intermediate table.

Note

Although RECURSIVE allows queries to be specified recursively, they are evaluated iteratively.

In the sum example above, the working table holds a single row at each step, taking values from 1 through 100. At step 100, the WHERE clause produces no output, so the query terminates.

Traversal order

The recursive query algorithm generates output in width-first search order. To display results in depth-first search order, have the outer query use ORDER BY on a path column constructed during traversal.

Cycle detection

Recursive queries over graph-structured or hierarchical data can loop indefinitely if the data contains cycles. Using UNION instead of UNION ALL eliminates fully duplicate rows, but a cycle may not produce identical output rows — it may only repeat a subset of fields. The standard approach is to track visited values in an array.

The following query searches a graph table via a link field. Without cycle detection it loops if the link relationships form a cycle:

WITH RECURSIVE search_graph(id, link, data, depth) AS (
    SELECT g.id, g.link, g.data, 1
    FROM graph g
  UNION ALL
    SELECT g.id, g.link, g.data, sg.depth + 1
    FROM graph g, search_graph sg
    WHERE g.id = sg.link
)
SELECT * FROM search_graph;

Add path and cycle columns to detect and stop at cycles:

WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS (
    SELECT g.id, g.link, g.data, 1,
      ARRAY[g.id],
      false
    FROM graph g
  UNION ALL
    SELECT g.id, g.link, g.data, sg.depth + 1,
      path || g.id,
      g.id = ANY(path)
    FROM graph g, search_graph sg
    WHERE g.id = sg.link AND NOT cycle
)
SELECT * FROM search_graph;

When multiple fields together identify a cycle, use an array of rows. For example, to check both f1 and f2:

WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS (
    SELECT g.id, g.link, g.data, 1,
      ARRAY[ROW(g.f1, g.f2)],
      false
    FROM graph g
  UNION ALL
    SELECT g.id, g.link, g.data, sg.depth + 1,
      path || ROW(g.f1, g.f2),
      ROW(g.f1, g.f2) = ANY(path)
    FROM graph g, search_graph sg
    WHERE g.id = sg.link AND NOT cycle
)
SELECT * FROM search_graph;

Note

When only one field needs to be checked, omit the ROW() syntax. A simple array is more efficient than a composite-type array.

Beyond cycle prevention, the path array is independently useful: it records the exact path taken to reach any particular row.

Hierarchical data example

A common use of recursive queries is finding all direct and indirect sub-parts of a product from a table of immediate parent-child relationships:

WITH RECURSIVE included_parts(sub_part, part, quantity) AS (
    SELECT sub_part, part, quantity FROM parts WHERE part = 'our_product'
  UNION ALL
    SELECT p.sub_part, p.part, p.quantity
    FROM included_parts pr, parts p
    WHERE p.part = pr.sub_part
)
SELECT sub_part, SUM(quantity) AS total_quantity
FROM included_parts
GROUP BY sub_part;

Testing potentially infinite queries

To test a recursive query that might loop, add a LIMIT to the parent query:

WITH RECURSIVE t(n) AS (
    SELECT 1
  UNION ALL
    SELECT n+1 FROM t
)
SELECT n FROM t LIMIT 100;

This works because PostgreSQL's implementation evaluates only as many rows of a WITH query as are actually fetched by the parent query, so LIMIT can stop runaway recursion during development. Do not rely on this in production — other systems may behave differently, and this approach does not work when the outer query sorts the recursive query's results or joins them to another table, because in those cases the outer query fetches all rows anyway.

CTE materialization

A useful property of CTEs is that they are normally evaluated only once per parent query execution, even if referenced multiple times. This makes CTEs a good place to put expensive calculations needed in multiple places.

The trade-off is that the query optimizer cannot push restrictions from the parent query into a multiply-referenced CTE, because doing so would affect all uses of the CTE's output when only one usage should be filtered. The CTE is evaluated as written, without suppressing rows that the parent query might later discard.

Default behavior

A non-recursive, side-effect-free CTE (a SELECT with no volatile functions) can be folded into the parent query, enabling joint optimization of both query levels.

Folded (merged) when the parent query references the CTE exactly once.
Materialized (a temporary copy is created) when the parent query references the CTE more than once.

Folded example — equivalent to querying big_table directly with the index on key:

WITH w AS (
    SELECT * FROM big_table
)
SELECT * FROM w WHERE key = 123;

Materialized example — the CTE is evaluated in full, then joined with itself without index benefit:

WITH w AS (
    SELECT * FROM big_table
)
SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref
WHERE w2.key = 123;

Overriding the default with MATERIALIZED and NOT MATERIALIZED

Use NOT MATERIALIZED to allow the parent query's restrictions to apply directly to scans of the underlying table:

WITH w AS NOT MATERIALIZED (
    SELECT * FROM big_table
)
SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref
WHERE w2.key = 123;

Use MATERIALIZED when the CTE contains an expensive function that should run only once per row:

WITH w AS (
    SELECT key, very_expensive_function(val) AS f FROM some_table
)
SELECT * FROM w AS w1 JOIN w AS w2 ON w1.f = w2.f;

Here, materialization ensures very_expensive_function is called once per table row, not twice.

Data-modifying statements in WITH

INSERT, UPDATE, and DELETE can appear in WITH, enabling several different operations in a single query.

The following query moves rows from products to products_log:

WITH moved_rows AS (
    DELETE FROM products
    WHERE
        "date" >= '2010-10-01' AND
        "date" < '2010-11-01'
    RETURNING *
)
INSERT INTO products_log
SELECT * FROM moved_rows;

The DELETE removes the matching rows from products and returns their contents via RETURNING. The primary INSERT reads that output and inserts it into products_log.

Note that the WITH clause is attached to the INSERT, not to the sub-SELECT inside INSERT. Data-modifying statements are only allowed in WITH clauses attached to the top-level statement, but normal WITH visibility rules still apply — the sub-SELECT can refer to the CTE's output.

RETURNING and temporary tables

The output of RETURNING — not the target table — forms the temporary table that the rest of the query can reference. If a data-modifying statement in WITH has no RETURNING clause, it forms no temporary table and cannot be referenced elsewhere. It still executes, however:

WITH t AS (
    DELETE FROM foo
)
DELETE FROM bar;

This removes all rows from both foo and bar. The row count reported to the client includes only rows removed from bar.

Using recursive CTEs with data-modifying statements

Recursive self-references inside data-modifying statements are not allowed. As a workaround, reference the output of a recursive WITH instead:

WITH RECURSIVE included_parts(sub_part, part) AS (
    SELECT sub_part, part FROM parts WHERE part = 'our_product'
  UNION ALL
    SELECT p.sub_part, p.part
    FROM included_parts pr, parts p
    WHERE p.part = pr.sub_part
)
DELETE FROM parts
WHERE part IN (SELECT part FROM included_parts);

This removes all direct and indirect sub-parts of a product.

Execution semantics

Data-modifying statements in WITH differ from SELECT in one important way: they always execute exactly once and run to completion, regardless of whether the primary query reads any of their output. By contrast, a SELECT in WITH runs only as far as the parent query demands.

All sub-statements in WITH execute concurrently with each other and with the main query using the same snapshot. They cannot see each other's effects on the target tables. RETURNING is the only mechanism for passing data between sub-statements and the main query.

This snapshot behavior determines what the outer SELECT sees:

-- Outer SELECT sees the original prices (before the UPDATE)
WITH t AS (
    UPDATE products SET price = price * 1.05
    RETURNING *
)
SELECT * FROM products;

-- Outer SELECT sees the updated data (from the RETURNING clause)
WITH t AS (
    UPDATE products SET price = price * 1.05
    RETURNING *
)
SELECT * FROM t;

Limitations

Recursive self-references in data-modifying statements are not supported. Compute the target set with a recursive WITH query, then reference it in the DELETE, UPDATE, or INSERT.
Updating the same row twice in a single statement is not supported. Only one modification takes place, and which one is unpredictable. The same applies to deleting a row that was already updated in the same statement — only the update is performed. Avoid writing WITH sub-statements that affect rows already targeted by the main statement or a sibling sub-statement.
Target tables of data-modifying statements in `WITH` must not have a conditional rule, an ALSO rule, or an INSTEAD rule that expands to multiple statements.
`LIMIT` as an infinite-loop guard is not production-safe. Other systems may not apply the same early-termination optimization, and it fails when the outer query sorts or joins the recursive results.