Build Delta Lake Pipelines with Spark SQL - AnalyticDB for MySQL

Delta Lake tables in AnalyticDB for MySQL support a complete data lifecycle: raw ingestion, partitioned storage, ACID mutations, incremental merges, historical lookbacks, and physical optimization. This topic walks through each stage using Spark SQL, from creating a Bronze-layer staging table to dropping the final production table.

Prerequisites

Before you begin, make sure you have:

An AnalyticDB for MySQL cluster with Spark enabled
An OSS bucket for storing Delta Lake data
Access to a Spark SQL session

Note

To use DROP TABLE PURGE to physically delete files from OSS, configure the following Spark parameters before starting Spark, then restart Spark for the changes to take effect:

spark.hive.purgeExternalTableData.enabled=true
spark.hive.purgeManagedTableData.enabled=true

Ingest raw data (Bronze layer)

Use CREATE TABLE AS SELECT (CTAS) to quickly ingest external data into Delta Lake without defining a schema or partitions upfront.

CREATE DATABASE IF NOT EXISTS deltalake_db LOCATION 'oss://<YOUR_OSS_BUCKET>/PATH/';

-- 1. Clean up metadata.
DROP TABLE IF EXISTS deltalake_db.orders_staging;

-- 2. Ingest data using CTAS (no partitions, full replication).
CREATE OR REPLACE TABLE deltalake_db.orders_staging
USING DELTA
COMMENT 'Bronze Layer: Raw Staging Table'
AS
SELECT * FROM ADB_External_TPCH_10GB.external_orders;

-- Verify: confirm the data loaded.
SELECT count(*) FROM deltalake_db.orders_staging;

Without partitions, full table scans are slow. The next step moves data into a partitioned production table to fix this.

Build a partitioned production table (Silver layer)

Create a production table partitioned by year, then load data from the staging table. Partitioning eliminates the small files problem and enables partition pruning for faster queries.

-- 1. Clean up metadata.
DROP TABLE IF EXISTS deltalake_db.orders_prod;

-- 2. Define the production table schema (includes the explicit o_order_year partition key column).
CREATE TABLE IF NOT EXISTS deltalake_db.orders_prod (
    o_orderkey     LONG,
    o_custkey      LONG,
    o_orderstatus  STRING,
    o_totalprice   DECIMAL(18,2),
    o_orderdate    DATE,
    o_orderpriority STRING,
    o_clerk        STRING,
    o_shippriority INT,
    o_comment      STRING,
    o_order_year   INT
)
USING DELTA
PARTITIONED BY (o_order_year)
COMMENT 'Silver Layer: Partitioned Production Table';

-- 3. Clean, aggregate, and load data from the staging table.
INSERT INTO deltalake_db.orders_prod
SELECT
    o_orderkey,
    o_custkey,
    o_orderstatus,
    o_totalprice,
    o_orderdate,
    o_orderpriority,
    o_clerk,
    o_shippriority,
    o_comment,
    YEAR(o_orderdate) AS o_order_year
FROM deltalake_db.orders_staging;

-- Verify: view partition information.
DESCRIBE DETAIL deltalake_db.orders_prod;

Note

Always list columns explicitly in INSERT INTO statements. Using SELECT * can cause column ordering mismatches with the target schema.

Update and delete records (ACID mutations)

Delta Lake supports ACID-compliant UPDATE and DELETE operations. Include the partition key in your WHERE clause to enable partition pruning and avoid full table scans.

-- Update: correct the status of orders for a specific clerk in 1993.
UPDATE deltalake_db.orders_prod
SET o_orderstatus = 'P'
WHERE o_order_year = 1993
  AND o_clerk = 'Clerk#000000001';

-- Delete: remove low-priority orders from 1992.
DELETE FROM deltalake_db.orders_prod
WHERE o_orderpriority = '5-LOW'
  AND o_order_year = 1992;

Merge incremental data (upsert)

Use MERGE to process incremental data: update matching records and insert new ones in a single atomic operation.

MERGE INTO deltalake_db.orders_prod AS target
USING (
    SELECT
        999999999        AS o_orderkey,
        12345            AS o_custkey,
        'O'              AS o_orderstatus,
        100.00           AS o_totalprice,
        CAST('2025-05-20' AS DATE) AS o_orderdate,
        '1-URGENT'       AS o_orderpriority,
        'Clerk#Auto'     AS o_clerk,
        0                AS o_shippriority,
        'New Order'      AS o_comment,
        2025             AS o_order_year
) AS source
ON target.o_orderkey = source.o_orderkey
WHEN MATCHED THEN
    UPDATE SET target.o_totalprice = source.o_totalprice
WHEN NOT MATCHED THEN
    INSERT *;

Query historical versions (time travel)

Delta Lake records every change as a versioned snapshot. Use DESCRIBE HISTORY to list all versions, then query any past state with VERSION AS OF.

-- View the full operation history.
DESCRIBE HISTORY deltalake_db.orders_prod;

-- Query the table state right after the initial data load.
-- Version 0 = CREATE TABLE, Version 1 = INSERT from staging table.
SELECT COUNT(*) FROM deltalake_db.orders_prod VERSION AS OF 1;

Optimize table performance

Run these operations periodically to keep query performance consistent. Delta Lake's physical optimization goes beyond what plain Parquet tables support.

-- 1. Compact small files within each partition into larger files.
OPTIMIZE deltalake_db.orders_prod;

-- 2. Apply Z-order clustering to accelerate queries filtered by o_custkey.
--    This physically co-locates rows with the same o_custkey value.
OPTIMIZE deltalake_db.orders_prod ZORDER BY (o_custkey);

-- 3. Delete orphan files to release OSS space.
--    Disables the built-in retention safety check — re-enable after testing.
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM deltalake_db.orders_prod RETAIN 0 HOURS;

Warning

RETAIN 0 HOURS immediately deletes all files not referenced by the current table version. Use this only in testing environments. In production, use RETAIN 168 HOURS (7 days) to protect long-running jobs and preserve time travel history.

Delete the table definition and data

DROP TABLE IF EXISTS deltalake_db.orders_prod PURGE;

DROP TABLE PURGE physically removes files from OSS only when spark.hive.purgeExternalTableData.enabled and spark.hive.purgeManagedTableData.enabled are both set to true. See Prerequisites.