This topic provides answers to some frequently asked questions about Delta Lake.

Why am I unable to create a table?

When you create a table in Delta Lake, you must use the LOCATION parameter to specify a directory that is used to store the table. The table is an external table in Spark. If no directory exists when you create a table, a table with a new schema is created. In most cases, this will not happen. If the directory that is specified by the LOCATION parameter exists, the schema used to create a table must be the same as the schema that is defined in the Delta log file in the directory. Otherwise, you cannot create a table.

Multiple small files are generated when I write streaming data to Delta Lake. What do I do?

When you use Spark Streaming to write data to Delta Lake, you are writing a series of mini batches of data. A batch of data generates one or more files. In most cases, the batch size is small. Therefore, continuous running of Spark Streaming will cause a significant number of small files to be generated. You can use the following methods to resolve the issue:
  • If you do not require real-time response, we recommend that you increase the size of each mini batch.
  • Periodically run the OPTIMIZE command to merge small files for the table.

Why does the OPTIMIZE command take a long period of time to run?

If the OPTIMIZE command is not run for a long period of time, a significant number of small files may be generated in Delta Lake. In this case, the OPTIMIZE command may take a long period of time to run. We recommend that you configure a scheduled task to periodically run the OPTIMIZE command.

The OPTIMIZE command fails to be run. What do I do?

You can run the OPTIMIZE command when you want to delete historical data and write new data. Delta Lake uses an optimistic locking mechanism. As a result, one of the write transactions fails when multiple write transactions are committed. If a streaming job is continuously updating the data in Delta Lake, especially in the case of Change Data Capture (CDC), the OPTIMIZE command is more likely to fail. If a streaming job only adds data but does not delete or update data, the OPTIMIZE command does not fail. We recommend that you partition a table by time and run the OPTIMIZE command on each partition after data is written to the partition.

Why do many small files still exist after the OPTIMIZE command is run?

The OPTIMIZE command is run to merge small files. However, the small files that are merged are not immediately deleted. Delta Lake retains historical data. When you access snapshots of a table, these small files are required. To delete the small files, run the VACUUM command.

Why do many small files still exist after the VACUUM command is run?

The VACUUM command is run to delete small files that are merged and exceeded the retention period. The default retention period is seven days. If a small file is not merged or is still within the retention period after it is merged, the VACUUM command cannot delete the file.

How do I delete small files that are recently generated and merged?

We recommend that you do not delete small files that are recently generated. This is because these small files may be required when you want to access the historical data of Delta Lake. If you want to delete the small files that are recently generated and merged, use one of the following methods:
  • Set spark.databricks.delta.retentionDurationCheck.enabled to false and pass the setting as a parameter when you start a Spark job.
  • Change the global retention period to a small value. For example, in the spark-defaults.conf file, set spark.databricks.delta.properties.defaults.deletedFileRetentionDuration to interval 1 hour.

Why do many Delta log files still exist after the VACUUM command is run?

The VACUUM command is run to merge data files, not Delta log files. Delta Lake automatically merges and deletes log files. Delta Lake merges log files every 10 commits. After log files are merged, Delta Lake checks for expired log files and deletes them. The default retention period of a Delta log file is 30 days.

Does Delta Lake support the automatic running of the OPTIMIZE command or the VACUUM command?

No, Delta Lake does not support the automatic running of the OPTIMIZE command or the VACUUM command. Delta Lake is only a library, but not a runtime. It does not provide a mechanism to automatically run the OPTIMIZE command or the VACUUM command. However, you can configure a scheduled task to periodically run the OPTIMIZE command or the VACUUM command.