This topic describes how to use Hudi Metastore in E-MapReduce (EMR).

Background information

An instant is added each time when Hudi performs an operation on data. When you query data, the metadata of each instant is read to obtain the valid partitions or files related to the instant. During the read process, the system spends an extended period of time performing partition listing and file listing operations because these operations cause heavy I/O workloads.

A data lake has unique metadata, such as instants and multiple versions of files. The schema of a data lake is different from the schema of a traditional table. Therefore, the EMR team introduces Hudi Metastore in the cloud to host the instant metadata of Hudi tables, and designs a lifecycle management system for partitions and files. You can accelerate the listing of partitions and files by using Hudi Metastore.

Prerequisites

A cluster of EMR V3.45.0 or a later minor version, or a cluster of EMR V5.11.0 or a later minor version is created in the China (Hangzhou), China (Shanghai), or China (Beijing) region, and DLF Unified Metadata is selected for Metadata.

Parameters

To use Hudi Metastore, perform the following operations: On the Configure tab of the Hudi service page, click the hudi.default.conf tab. Then, configure the parameters that are described in the following table.

ParameterDescription
hoodie.metastore.typeThe implementation mode of Hudi metadata. Valid values:
  • LOCAL: uses the native metadata tables in Hudi.
  • METASTORE: uses the metadata tables of Hudi Metastore in EMR.
hoodie.metadata.enableSpecifies whether to use the native metadata tables in Hudi. Valid values:
  • false: does not use the native metadata tables in Hudi.
    Note You can use the metadata tables of Hudi Metastore in EMR only if the parameter is set to false.
  • true: uses the native metadata tables in Hudi.
You can set the parameters to different values based on your business requirements.
  • Do not use metadata tables
    hoodie.metastore.type=LOCAL
    hoodie.metadata.enable=false
  • Use the native metadata tables in Hudi
    hoodie.metastore.type=LOCAL
    hoodie.metadata.enable=true
  • Use the metadata tables of Hudi Metastore in EMR (default)
    hoodie.metastore.type=METASTORE
    hoodie.metadata.enable=false

Example

The following example describes how to use the metadata tables of Hudi Metastore in EMR and enable the acceleration feature of Hudi Metastore by using the Spark SQL CLI.

  1. Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to open the Spark SQL CLI. In this example, Spark 3 is used.
    spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
              --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
    If the output contains the following information, the Spark SQL CLI is opened:
    spark-sql>
  3. Run the following command to create a Hudi table:
    create table h0(
      id bigint,
      name string,
      price double
    ) using hudi
    tblproperties (
       primaryKey = 'id',
       preCombineField = 'id'
    ) location '/tmp/hudi_cases/h0';
  4. Run the following command to insert data into the table:
    insert into h0 select 1, 'a1', 10;
  5. Run the following command to exit the Spark SQL CLI:
    exit;
  6. Run the following command to view the hoodie.properties file in the .hoodie directory of the Hudi table:
    hdfs dfs -cat /tmp/hudi_cases/h0/.hoodie/hoodie.properties
    If the output contains hoodie.metastore.type=METASTORE and hoodie.metastore.table.id, Hudi Metastore is used.
    hoodie.metastore.catalog.id=
    hoodie.table.precombine.field=id
    hoodie.datasource.write.drop.partition.columns=false
    hoodie.table.type=COPY_ON_WRITE
    hoodie.archivelog.folder=archived
    hoodie.timeline.layout.version=1
    hoodie.table.version=5
    hoodie.metastore.type=METASTORE
    hoodie.table.recordkey.fields=id
    hoodie.datasource.write.partitionpath.urlencode=false
    hoodie.database.name=test_db
    hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
    hoodie.table.name=h0
    hoodie.datasource.write.hive_style_partitioning=true
    hoodie.metastore.table.id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    hoodie.table.checksum=3349919362
    hoodie.table.create.schema={"type"\:"record","name"\:"h0_record","namespace"\:"hoodie.h0","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"price","type"\:["double","null"]}]}