Delta Lake Architecture & Core Concepts on EMR - EMR on ECS

This page answers common questions about using Delta Lake on Alibaba Cloud EMR.

Why can't I create a table?
Why does Spark Streaming generate so many small files in Delta Lake?
Why does OPTIMIZE take so long to run?
Why does OPTIMIZE fail?
Why do small files remain after running OPTIMIZE?
Why do small files remain after running VACUUM?
How do I delete recently merged small files?
Why do Delta log files remain after running VACUUM?
Does Delta Lake support scheduling OPTIMIZE or VACUUM automatically?

Why can't I create a table?

Delta Lake requires the LOCATION parameter when creating a table. It specifies the directory where table data is stored, and the resulting table is an external table in Apache Spark.

The behavior depends on whether that directory already exists:

Directory does not exist: Delta Lake creates a new table with the schema you defined.
Directory already exists: The schema you specify must exactly match the schema recorded in the Delta log file in that directory. Any mismatch causes table creation to fail.

Check the existing Delta log file in the directory and align your CREATE TABLE schema to match it.

Why does Spark Streaming generate so many small files in Delta Lake?

Spark Streaming writes data as a series of mini batches, and each mini batch produces one or more files. With small batch sizes and continuous execution, the number of files accumulates quickly.

Choose a strategy based on your latency requirements:

Strategy	When to use	Trade-off
Increase mini batch size	Latency of a few minutes is acceptable	Fewer files, slightly higher end-to-end latency
Run OPTIMIZE on a schedule	Real-time response is required	More frequent compaction overhead; files accumulate between runs

Both strategies can be combined: set a larger batch size to slow file accumulation, then run OPTIMIZE periodically to compact what remains.

Why does OPTIMIZE take so long to run?

When OPTIMIZE hasn't run for a long time, a large number of small files accumulate in Delta Lake. OPTIMIZE must read and rewrite all of them, which takes proportionally longer.

Set up a scheduled task to run OPTIMIZE regularly. The right frequency depends on your workload:

Better query performance: run OPTIMIZE more often (daily or more frequently).
Lower cost: run it less often.

A daily OPTIMIZE job is a good starting point, ideally during off-peak hours when compute costs are lower. Adjust the frequency based on your observed query performance and cost.

Why does OPTIMIZE fail?

Delta Lake uses an optimistic locking mechanism: when multiple write transactions are committed at the same time, one of them fails. OPTIMIZE is itself a write transaction, so it can conflict with concurrent writes.

OPTIMIZE is most likely to fail when a streaming job continuously deletes or updates data — a pattern common in Change Data Capture (CDC) workflows. If the streaming job only appends data without deletes or updates, OPTIMIZE does not fail.

To reduce conflicts, partition the table by time and run OPTIMIZE on each partition after data has been fully written to it. This avoids OPTIMIZE competing with an active write window.

Why do small files remain after running OPTIMIZE?

OPTIMIZE compacts small files into larger ones but does not delete the original files. Delta Lake retains those files to support snapshot isolation — queries that started before the compaction can still read the original files to access a consistent snapshot of the table.

To delete the compacted files, run VACUUM after OPTIMIZE. VACUUM removes files that have been superseded and have exceeded the retention period (7 days by default).

Why do small files remain after running VACUUM?

VACUUM only deletes files that meet two conditions: they have been merged by OPTIMIZE, and they have exceeded the retention period. The default retention period is 7 days.

If files are still within the retention period, or if they haven't been merged yet, VACUUM leaves them in place. Run OPTIMIZE first, then wait for the retention period to elapse before running VACUUM to ensure the files are eligible for deletion.

How do I delete recently merged small files?

Recently merged files are still within the 7-day retention period, so VACUUM won't delete them by default. This retention window protects historical snapshot access — removing files too early can break queries that rely on time travel.

Warning

Overriding the retention period can cause active queries to fail if they are still reading the files you delete. Only proceed if you are certain no queries or streaming jobs are accessing historical snapshots of this table.

If you are sure it is safe, use one of these methods:

Disable the retention check: Set the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false and pass it as a parameter when starting your Spark job. Then run VACUUM with a short retention interval.
Shorten the global retention period: In spark-defaults.conf, set spark.databricks.delta.properties.defaults.deletedFileRetentionDuration to interval 1 hour. This applies globally to all Delta tables on the cluster.

Why do Delta log files remain after running VACUUM?

VACUUM manages data files, not Delta log files. Delta Lake handles log file lifecycle automatically:

Every 10 commits, Delta Lake merges log files.
After merging, Delta Lake identifies and deletes expired log files.
The default retention period for Delta log files is 30 days.

No manual intervention is needed. If you see many log files, they are likely still within the 30-day retention window or have not yet reached the 10-commit threshold.

Does Delta Lake support scheduling OPTIMIZE or VACUUM automatically?

No. Delta Lake is a library, not a runtime, so it has no built-in scheduler. OPTIMIZE and VACUUM must be triggered externally.

Configure a scheduled task in your workflow orchestration system — such as Apache Airflow or a cron job — to run these commands periodically on your Delta tables.