Hudi Common Issues with Spark and Hive Explained - E-MapReduce

This page explains how to resolve common issues when querying Apache Hudi tables on EMR using Spark and Hive.

Duplicate data when querying with Spark

Symptom: Spark returns duplicate records when querying a Hudi table.

Cause: Reading Hudi data through the Spark Data Source API is not supported and bypasses Hudi's merge logic, which causes duplicates.

Fix: Add the following configuration to your Spark query command:

spark.sql.hive.convertMetastoreParquet=false

Duplicate data when querying with Hive

Symptom: Hive returns duplicate records when querying a Hudi table.

Cause: Hive uses HiveCombineInputFormat by default. This input format cannot invoke a table's custom input format, so Hudi's merge logic is skipped and duplicate records are returned.

Fix: Add the following to your Hive query command:

set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat

Partition pruning does not take effect when querying with Spark

Symptom: Spark queries scan all partitions instead of pruning to the relevant ones.

Cause: If a partition field name contains a forward slash (/), Spark detects a different number of partition levels than actually exist. The mismatch prevents partition pruning from working.

Fix: Add the following to your Spark DataFrame API write command for the Hudi table:

hoodie.datasource.write.partitionpath.urlencode= true

Error "xxx is only supported with v2 tables" when running ALTER TABLE in Spark

Symptom: Running an ALTER TABLE statement on a Hudi table in Spark returns the error xxx is only supported with v2 tables.

Cause: hoodie.schema.on.read.enable is not set to true. The Hudi-Spark schema evolution feature requires this configuration before you can run ALTER TABLE.

Fix: Add the following to your ALTER TABLE statement:

set hoodie.schema.on.read.enable=true

For supported syntax, see SparkSQL schema evolution and syntax description in the Apache Hudi documentation.