This topic describes the frequently asked questions (FAQ) about using Delta.

Why does a table fail to be created in Delta?

When you create a Delta table, you must specify the Delta directory through the LOCATION clause. A Delta table is a foreign table in Spark. If the target directory does not exist, a new table is created, though theoretically this will not happen. If the target directory exists, the schema in the table creation statement must be the same as the schema defined in the Delta log file in the directory. Otherwise, the table fails to be created.

Why are so many small files generated when I write a stream of data to Delta?

When you use Spark Streaming to write data to Delta, you are essentially writing a series of mini batches of data. A batch of data generates one or more files. The batch size is usually small. Therefore, continuous running of Spark Streaming produces a considerable number of small files. The following workarounds are available:
  • If real-time response is not prioritized, we recommend that you increase the size of each mini batch.
  • Run the optimize command periodically to merge small files for the table.

Why does the optimize command take so long to run?

If the optimize command has not been run in a long time, a considerable number of small files may accumulate in Delta. In this case, it may take a long time to run the optimize command. We recommend that you configure a scheduled task to periodically run the optimize command.

Why does the optimize command fail?

The optimize command deletes historical data and writes new data. Due to the optimistic locking mechanism adopted by Delta, one write transaction will fail when multiple write transactions are committed. If a streaming job is constantly updating the data in Delta, for example, in the case of Change Data Capture (CDC), the optimize command is more likely to fail. Note that if a streaming job only adds data and does not delete or update data, the optimize command will not fail. We recommend that you partition a table by time and run the optimize command on every partition after data is written to the partition.

Why do many small files still exist after the optimize command is run?

The optimize command merges small files, but the merged small files are not deleted immediately. Delta retains historical data for access. When you access snapshots of a table, Delta needs to read these small files. To delete these small files, run the vacuum command.

Why do many small files still exist after the vacuum command is run?

The vacuum command deletes small files that have been merged and have exceeded the retention period. The default retention period is 7 days. If a small file has not been merged or is within the retention period, the vacuum command will not delete the file.

How do I delete small files that are generated recently and have been merged?

We recommend that you do not delete small files that are created recently. Delta may use these small files to access the history of the table. If you insist on doing so, the following options are available:
  • Turn off the retention period check by passing as a parameter when you start a Spark job.
  • Change the global retention period to a small value. For example, in the spark-defaults.conf file, set the parameter to interval 1 hour.

Why do many Delta log files still exist after the vacuum command is run?

The vacuum command merges data files, not Delta log files. Delta automatically merges and deletes log files. Delta merges log files every 10 commits. After log files are merged, Delta checks for expired log files and deletes them. The default validity period of a Delta log file is 30 days.

Does Delta support automatically running the optimize or vacuum command?

No, Delta is only a library, not a runtime, and does not provide a mechanism for automatically running the optimize or vacuum command. However, you can configure a scheduled task to run the optimize or vacuum command periodically.