This topic describes how to use the Optimize and Vacuum commands to optimize tables.

Optimize

  • Scala
    import io.delta.tables._
    
    val deltaTable = DeltaTable.forPath(spark, "/tmp/delta_table")
    
    deltaTable.optimizeExpr("date < '2019-11-11'", Seq("date", "id"))                       // Use an SQL formatted string.
    
    import org.apache.spark.sql.functions._
    import spark.implicits._
    
    deltaTable.optimize(col("date < '2019-11-11'"), Seq(col("date"), col("id")))            // Use an SQL function and implicit conversion.
  • SQL
    OPTIMIZE delta_table [WHERE t.date < '2019-11-11'] [ZORDER BY t.date,t.id];

    You can call the deltaTable.optimize() function without specifying any parameters to optimize the entire table.

    Note The Z-ordering feature is supported in EMR V3.27.0 and later.

Vacuum

After you run the Optimize command, small files are merged into large files, and the original small files are marked as tombstones. The tombstones are not automatically deleted. Delta still reads historical data from the tombstones. This affects the table optimization efficiency.

The Vacuum command is used to delete historical files. If the data in tombstones has been outdated for a long time, you can run the Vacuum command to delete the tombstones. By default, tombstones can be deleted only after the specified retention period elapses. If you delete a tombstone that is still in the retention period, an exception is thrown.

If you want to delete recent tombstones, such as small files marked as tombstones one day ago, use one of the following methods:
  • (Recommended) Set a retention period. For example, you can set the spark.databricks.delta.properties.defaults.deletedFileRetentionDuration parameter to interval 1 day, 2 weeks, or 365 days. You cannot specify Month and Year. This retention period parameter is a static property of a table. You cannot set this parameter by running a command or configuring --conf. You can follow these steps to set this parameter:
    1. In the spark-default.conf file, configure global properties.
    2. Execute the ALTER TABLE <tbl> SET PROPERTIES (key=value) statement to configure the property of the target table.
  • (Not recommended) Disable the retention period check by setting spark.databricks.delta.retentionDurationCheck.enabled to false. Then, you can delete all the recently merged small files.
Notice
  • We recommend that you do not disable the retention period check.
  • We recommend that you do not specify a retention period that is too short.

    If the retention period is too short, a job that reads a snapshot of a table may fail. Temporary files generated by a job that is being executed may even be cleared, which further leads to the failure of the job.

  • Scala
    import io.delta.tables._
    val deltaTable = DeltaTable.forPath(spark, pathToTable)
    
    deltaTable.vacuum()        // Delete all tombstones whose retention period has elapsed.
    deltaTable.vacuum(100)     // Delete small files that are marked as tombstones 100 hours ago.
  • SQL
    VACUUM table_name [RETAIN num HOURS] [DRY RUN]
    Take note of the following points:
    • You can use the RETAIN clause to specify a retention period for tombstones. The default value of RETAIN num HOURS is the value of the spark.databricks.delta.properties.defaults.deletedFileRetentionDuration parameter.
    • If DRY RUN is specified, tombstones are not actually deleted.