This topic describes how to use the Optimize and Vacuum commands to optimize tables.
import io.delta.tables._ val deltaTable = DeltaTable.forPath(spark, "/tmp/delta_table") deltaTable.optimizeExpr("date < '2019-11-11'", Seq("date", "id")) // Use an SQL formatted string. import org.apache.spark.sql.functions._ import spark.implicits._ deltaTable.optimize(col("date < '2019-11-11'"), Seq(col("date"), col("id"))) // Use an SQL function and implicit conversion.
OPTIMIZE delta_table [WHERE t.date < '2019-11-11'] [ZORDER BY t.date,t.id];
You can call the
deltaTable.optimize()function without specifying any parameters to optimize the entire table.Note The Z-ordering feature is supported in EMR V3.27.0 and later.
After you run the Optimize command, small files are merged into large files, and the original small files are marked as tombstones. The tombstones are not automatically deleted. Delta still reads historical data from the tombstones. This affects the table optimization efficiency.
The Vacuum command is used to delete historical files. If the data in tombstones has been outdated for a long time, you can run the Vacuum command to delete the tombstones. By default, tombstones can be deleted only after the specified retention period elapses. If you delete a tombstone that is still in the retention period, an exception is thrown.
- (Recommended) Set a retention period. For example, you can set the
interval 1 day,
2 weeks, or
365 days. You cannot specify
Year. This retention period parameter is a static property of a table. You cannot set this parameter by running a command or configuring
--conf. You can follow these steps to set this parameter:
- In the spark-default.conf file, configure global properties.
- Execute the
ALTER TABLE <tbl> SET PROPERTIES (key=value)statement to configure the property of the target table.
- (Not recommended) Disable the retention period check by setting
spark.databricks.delta.retentionDurationCheck.enabledto false. Then, you can delete all the recently merged small files.
- We recommend that you do not disable the retention period check.
- We recommend that you do not specify a retention period that is too short.
If the retention period is too short, a job that reads a snapshot of a table may fail. Temporary files generated by a job that is being executed may even be cleared, which further leads to the failure of the job.
import io.delta.tables._ val deltaTable = DeltaTable.forPath(spark, pathToTable) deltaTable.vacuum() // Delete all tombstones whose retention period has elapsed. deltaTable.vacuum(100) // Delete small files that are marked as tombstones 100 hours ago.
VACUUM table_name [RETAIN num HOURS] [DRY RUN]Take note of the following points:
- You can use the
RETAINclause to specify a retention period for tombstones. The default value of
RETAIN num HOURSis the value of the
- If DRY RUN is specified, tombstones are not actually deleted.
- You can use the