All Products
Search
Document Center

E-MapReduce:Hudi connector

Last Updated:Dec 06, 2023

Hudi is a data lake framework that allows you to update and delete data in Hadoop compatible file systems. Hudi also allows you to consume changed data. E-MapReduce (EMR) Trino integrates related JAR packages into the independent Hudi component. You can use the EMR Hudi connector to query data in Copy on Write and Merge on Read tables.

Background information

For more information about EMR Hudi, see Hudi.

Prerequisites

A DataLake cluster or Hadoop cluster is created, and the Presto service is selected. For more information, see Create a cluster.

Limits

  • Only DataLake clusters of all versions and Hadoop clusters of EMR V3.38.0 and later versions support the Hudi connector.

  • For Copy on Write tables, the Hudi connector can query only snapshots of the tables.

  • For Merge on Read tables, the Hudi connector can query snapshots and read optimization results of the tables only in some specific scenarios. Exercise caution when you use the Hudi connector for the queries in the production environment.

  • Incremental queries are not supported.

Configure the Hudi connector

Modify the configurations of the Hudi connector. For more information, see Modify the configurations of a built-in connector.

Use Hive metadata

To view the default configurations of the Hudi connector, perform the following steps: In the EMR console, go to the Configure tab of the Trino service page. On the Configure tab, click the hudi.properties tab. Modify the parameters that are described in the following table based on your business requirements.

Parameter

Description

hive.metastore.uri

The Uniform Resource Identifier (URI) that is used to access the Hive metastore based on the Thrift protocol.

  • DataLake and custom clusters: Set this parameter to a value in the thrift://master-1-1.cluster-24****:9083 format.

  • Hadoop clusters: Set this parameter to a value in the thrift://emr-header-1.cluster-24****:9083 format.

hive.config.resources

Hadoop Distributed File System (HDFS) configuration files. Separate the names of configuration files with commas (,). You must make sure that the configuration files exist on all the hosts where Trino is running.

Important

You must configure this parameter if you want to access HDFS.

  • DataLake and custom clusters: The default value is /etc/emr/hadoop-conf/core-site.xml,/etc/emr/hadoop-conf/hdfs-site.xml.

  • Hadoop clusters: The default value is /etc/ecm/hadoop-conf/core-site.xml, /etc/ecm/hadoop-conf/hdfs-site.xml.

hive.hdfs.impersonation.enabled

Specifies whether to enable user impersonation. Valid values:

  • true: enables user impersonation. This is the default value.

  • false: disables user impersonation.

Use DLF unified metadata

If the Metadata parameter is set to DLF Unified Metadata for your data cluster, you must configure connectors such as Hive, Iceberg, and Hudi. In this case, data queries no longer depend on your data cluster. You can configure the hive.metastore.uri parameter based on your business requirements. Trino can directly access Data Lake Formation (DLF) metadata within the same account.

The following table describes the parameters that are used to configure a metadata storage center for data in data lakes.
ParameterDescriptionRemarks
hive.metastoreThe type of the Hive metastore. This parameter is fixed to DLF.
dlf.catalog.regionThe ID of the region in which Data Lake Formation (DLF) is activated. For more information, see Supported regions and endpoints.
Note Make sure that the value of this parameter matches the endpoint specified by the dlf.catalog.endpoint parameter.
dlf.catalog.endpointThe endpoint of the DLF service. For more information, see Supported regions and endpoints.
We recommend that you set the dlf.catalog.endpoint parameter to a VPC endpoint of DLF. For example, if you select the China (Hangzhou) region, set the dlf.catalog.endpoint parameter to dlf-vpc.cn-hangzhou.aliyuncs.com.
Note You can also use a public endpoint of DLF. If you select the China (Hangzhou) region, set the dlf.catalog.endpoint parameter to dlf.cn-hangzhou.aliyuncs.com.
dlf.catalog.akModeThe AccessKey mode of the DLF service. We recommend that you set this parameter to EMR_AUTO.
dlf.catalog.proxyModeThe proxy mode of the DLF service. We recommend that you set this parameter to DLF_ONLY.
dlf.catalog.uidThe ID of your Alibaba Cloud account. To obtain the ID of your Alibaba Cloud account, go to the Security Settings page.

Example

Hudi tables serve as external tables of Hive. You can use the Hive connector to query data in Hudi tables. For more information about how to create a Hudi table and how to synchronize data from a Hudi table to a Hive table, see Integrate Hudi with Spark SQL and Basic usage.

Example of creating a table and querying data in the table:

  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.

  2. Run the following command to start the Spark SQL CLI:

    spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

    If the output contains the following information, the Spark SQL CLI is started:

    spark-sql>
  3. Run the following command to create a test table named emr_test:

    create table if not exists emr_test(
      id bigint,
      name string,
      price double
    ) using hudi
    options (
      type = 'mor',
      primaryKey = 'id,name'
    );
  4. Run the following commands to insert test data into the emr_test table:

    insert into emr_test select 1, 'a2', 10;
    insert into emr_test select 1, 'a1', 10;
    insert into emr_test select 2, 'a1', 20;
    Note

    EMR Spark SQL automatically synchronizes data of Hudi tables to the Hive metastore or DLF.

  5. Query data in the Trino client.

    1. Run the following command to open the Trino CLI:

      trino --server master-1-1:9090 --catalog hudi --schema default --user hadoop
    2. Run the following command to query data in the emr_test table:

      select * from emr_test;

      The following output is returned:

       _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path |                            _hoodie_file_name                            | id | name | price
      ---------------------+----------------------+--------------------+------------------------+-------------------------------------------------------------------------+----+------+-------
       20211025145616      | 20211025145616_0_1   | id:1,name:a2       |                        | ac4ec1e6-528d-4189-bde6-d09e137f63f6-0_0-20-1604_20211025145616.parquet |  1 | a2   |  10.0
       20211025145629      | 20211025145629_0_1   | id:1,name:a1       |                        | ac4ec1e6-528d-4189-bde6-d09e137f63f6-0_0-48-3211_20211025145629.parquet |  1 | a1   |  10.0
       20211025145640      | 20211025145640_0_2   | id:2,name:a1       |                        | ac4ec1e6-528d-4189-bde6-d09e137f63f6-0_0-76-4818_20211025145640.parquet |  2 | a1   |  20.0
      (3 rows)
  6. Update data in the Spark SQL CLI.

    1. Run the following command to start the Spark SQL CLI:

      spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
      --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

      If the output contains the following information, the Spark SQL CLI is started:

      spark-sql>
    2. Run the following command to update the value of price for the data entry whose ID is 2:

      update emr_test set price = price + 20 where id = 2;
  7. After the update is complete, query data in the Trino client.

    1. Run the following command to open the Trino CLI:

      trino --server master-1-1:9090 --catalog hudi --schema default --user hadoop
    2. Run the following command to query data in the emr_test table:

      select * from emr_test;

      The following output is returned:

       _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path |                            _hoodie_file_name                            | id | name | price
      ---------------------+----------------------+--------------------+------------------------+-------------------------------------------------------------------------+----+------+-------
       20211025145616      | 20211025145616_0_1   | id:1,name:a2       |                        | ac4ec1e6-528d-4189-bde6-d09e137f63f6-0_0-20-1604_20211025145616.parquet |  1 | a2   |  10.0
       20211025145629      | 20211025145629_0_1   | id:1,name:a1       |                        | ac4ec1e6-528d-4189-bde6-d09e137f63f6-0_0-48-3211_20211025145629.parquet |  1 | a1   |  10.0
       20211025145640      | 20211025145640_0_2   | id:2,name:a1       |                        | ac4ec1e6-528d-4189-bde6-d09e137f63f6-0_0-76-4818_20211025145640.parquet |  2 | a1   |  40.0
      (3 rows)