This topic provides answers to some frequently asked questions about Hudi.

What do I do if duplicate data is returned when I use Spark to query data in a Hudi table?

  • Cause: You are not allowed to read Hudi data over the Data Source API of Spark.
  • Solution: Add spark.sql.hive.convertMetastoreParquet=false to the command that is used to query data in a Hudi table.

What do I do if duplicate data is returned when I use Hive to query data in a Hudi table?

  • Cause: By default, Hive uses HiveCombineInputFormat. However, this input format class cannot be used to call a input format that is customized for a table.
  • Solution: Add set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat to the command that is used to query data in a Hudi table.

What do I do if partition pruning does not take effect when Spark is used to query data in a Hudi table?

  • Cause: If the name of a partition field contains a forward slash (/), the number of partition fields detected during the query is inconsistent with the actual number of partition levels. As a result, partition pruning does not take effect.
  • Solution: Add hoodie.datasource.write.partitionpath.urlencode= true to the command that is used to write data to a Hudi table over the Spark DataFrame API.