This page explains how to resolve common issues when querying Apache Hudi tables on EMR using Spark and Hive.
Duplicate data when querying with Spark
Symptom: Spark returns duplicate records when querying a Hudi table.
Cause: Reading Hudi data through the Spark Data Source API is not supported and bypasses Hudi's merge logic, which causes duplicates.
Fix: Add the following configuration to your Spark query command:
spark.sql.hive.convertMetastoreParquet=false
Duplicate data when querying with Hive
Symptom: Hive returns duplicate records when querying a Hudi table.
Cause: Hive uses HiveCombineInputFormat by default. This input format cannot invoke a table's custom input format, so Hudi's merge logic is skipped and duplicate records are returned.
Fix: Add the following to your Hive query command:
set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat
Partition pruning does not take effect when querying with Spark
Symptom: Spark queries scan all partitions instead of pruning to the relevant ones.
Cause: If a partition field name contains a forward slash (/), Spark detects a different number of partition levels than actually exist. The mismatch prevents partition pruning from working.
Fix: Add the following to your Spark DataFrame API write command for the Hudi table:
hoodie.datasource.write.partitionpath.urlencode= true
Error "xxx is only supported with v2 tables" when running ALTER TABLE in Spark
Symptom: Running an ALTER TABLE statement on a Hudi table in Spark returns the error xxx is only supported with v2 tables.
Cause: hoodie.schema.on.read.enable is not set to true. The Hudi-Spark schema evolution feature requires this configuration before you can run ALTER TABLE.
Fix: Add the following to your ALTER TABLE statement:
set hoodie.schema.on.read.enable=true
For supported syntax, see SparkSQL schema evolution and syntax description in the Apache Hudi documentation.