All Products
Search
Document Center

E-MapReduce:FAQ

Last Updated:Sep 19, 2023

This topic provides answers to some frequently asked questions about Hudi.

What do I do if duplicate data is returned when I use Spark to query data in a Hudi table?

  • Cause: You are not allowed to read Hudi data by using the Data Source API of Spark.

  • Solution: Add spark.sql.hive.convertMetastoreParquet=false to the command that is used to query data in a Hudi table.

What do I do if duplicate data is returned when I use Hive to query data in a Hudi table?

  • Cause: By default, Hive uses HiveCombineInputFormat. However, this input format class cannot be used to call an input format that is customized for a table.

  • Solution: Add set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat to the command that is used to query data in a Hudi table.

What do I do if partition pruning does not take effect when I use Spark to query data in a Hudi table?

  • Cause: If the name of a partition field contains a forward slash (/), the number of partition fields detected during the query is inconsistent with the actual number of partition levels. As a result, partition pruning does not take effect.

  • Solution: Add hoodie.datasource.write.partitionpath.urlencode= true to the command that is used to write data to a Hudi table by using the DataFrame API of Spark.

What do I do if the error message "xxx is only supported with v2 tables" appears when I execute the ALTER TABLE statement in Spark?

  • Cause: The hoodie.schema.on.read.enable configuration item for Hudi is not set to true when you use the Hudi-Spark schema evolution feature.

  • Solution: Add set hoodie.schema.on.read.enable=true to the ALTER TABLE statement that is executed for a Hudi table. For more information, see SparkSQL Schema Evolution and Syntax Description of Apache Hudi.