All Products
Search
Document Center

E-MapReduce:Use a Trino cluster

Last Updated:Jun 27, 2023

After you create a Trino cluster in the E-MapReduce (EMR) console, you must configure the cluster before you can use it. This topic describes how to configure connectors and a metadata storage center for data in data lakes to use the Trino cluster that you created.

Background information

To use the Trino service, you can create a DataLake cluster, custom cluster, or Hadoop cluster with the Trino service deployed or create a Trino cluster in the EMR console. A Trino cluster has the following features:

  • Allocates exclusive resources to Trino. This way, other services can hardly affect the Trino service.

  • Supports auto scaling.

  • Supports the analysis of data in data lakes and real-time data warehousing.

  • Stores no data.

Note
  • Hudi and Iceberg are not processes and do not occupy cluster resources.

  • If Hue and JindoData or SmartData are no longer used, you can stop the services.

To use a Trino cluster, you must create a DataLake cluster, custom cluster, or Hadoop cluster first or use an existing DataLake cluster, custom cluster, or Hadoop cluster as a data cluster.

After you create a Trino cluster, you need to perform the following operations:

Configure connectors

This section describes how to configure query objects in the connectors that you want to use. In this example, a Hive connector is used.

  1. Go to the Services tab.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.

    3. On the EMR on ECS page, find the desired cluster and click Services in the Actions column.

  2. On the Services tab, find the Trino service and click Configure.

  3. Modify parameters.

    1. On the Configure tab, click hive.properties.

    2. Change the value of the hive.metastore.uri parameter to the value of the hive.metastore.uri parameter that is configured for the Trino service deployed in your data cluster.

  4. Save the configuration.

    1. Click Save.

    2. In the dialog box that appears, configure the parameters and click Save.

  5. Make the configuration take effect.

    1. Click Deploy Client Configuration.

    2. In the dialog box that appears, configure the parameters and click OK.

    3. In the Confirm message, click OK.

  6. Restart the Trino service. For more information, see Restart a service.

  7. Configure host information.

    Important

    If all the data that you want to query is stored in Object Storage Service (OSS) or the Location parameter is configured when you execute the CREATE TABLE statement, you do not need to configure host information.

    Some Hive tables may be stored in the specified default directory. When you query data that is stored in your data cluster, you must configure the host information of the master node of your data cluster for each node in the Trino cluster to make sure that you can read data from the tables in the query process.

    • Method 1: Log on to the EMR console and add a cluster script or a bootstrap action to configure the host information. For more information, see Manually run scripts or Manage bootstrap actions. We recommend that you use this method.

    • Method 2: Directly modify the hosts file. Perform the following steps:

      1. Obtain the internal IP address of the master node of your data cluster. To view the internal IP address of the master node, perform the following steps: Go to the Nodes tab of your data cluster in the EMR console. Find the master node group and click the add icon.

      2. Log on to your data cluster. For more information, see Log on to a cluster.

      3. Run the hostname command to obtain the hostname of the master node.

        For example, the hostname is in one of the following formats:

        • Hadoop cluster: emr-header-1.cluster-26****

        • Other types of clusters: master-1-1.c-f613970e8c****

      4. Log on to the Trino cluster. For more information, see Log on to a cluster.

      5. Run the following command to edit the hosts file:

        vim /etc/hosts
      6. Add the following content to the end of the hosts file:

        Add the internal IP address and hostname of the master node of your data cluster to the hosts file that is stored in the /etc/ directory of each node of the Trino cluster.

        • Hadoop cluster

          192.168.**.** emr-header-1.cluster-26****
        • Other types of clusters

          192.168.**.** master-1-1.c-f613970e8c****

Configure a metadata storage center for data in data lakes

Note

A metadata storage center for data in data lakes can be automatically configured when you create an EMR cluster of V3.45.0 or a later minor version, or V5.11.0 or a later minor version.

If the Metadata parameter is set to DLF Unified Metadata for your data cluster, you must configure connectors such as Hive, Iceberg, and Hudi. In this case, data queries no longer depend on your data cluster. You can configure the hive.metastore.uri parameter based on your business requirements. Trino can directly access Data Lake Formation (DLF) metadata within the same account.

The following table describes the parameters that are used to configure a metadata storage center for data in data lakes.

Parameter

Description

Remarks

hive.metastore

The type of the Hive metastore.

Set the value to DLF.

dlf.catalog.id

The ID of the DLF catalog.

By default, this parameter is set to the ID of your Alibaba Cloud account.

dlf.catalog.region

The ID of the region in which DLF is activated.

For more information, see Supported regions and endpoints.

Note

Make sure that the value of this parameter matches the endpoint specified by the dlf.catalog.endpoint parameter.

dlf.catalog.endpoint

The endpoint of DLF.

For more information, see Supported regions and endpoints.

We recommend that you set the dlf.catalog.endpoint parameter to a VPC endpoint of DLF. For example, if you select the China (Hangzhou) region, set the dlf.catalog.endpoint parameter to dlf-vpc.cn-hangzhou.aliyuncs.com.

Note

You can also use a public endpoint of DLF. For example, if you select the China (Hangzhou) region, set the dlf.catalog.endpoint parameter to dlf.cn-hangzhou.aliyuncs.com.

dlf.catalog.akMode

The AccessKey mode of the DLF service.

We recommend that you set this parameter to EMR_AUTO.

dlf.catalog.proxyMode

The proxy mode of the DLF service.

We recommend that you set this parameter to DLF_ONLY.

dlf.catalog.uid

The ID of your Alibaba Cloud account.

To obtain the ID of your Alibaba Cloud account, go to the Security Settings page.获取登录账号

Example: Query data in a table

  1. Run commands to access Trino. For more information, see Log on to the Trino console by running commands.

  2. Run the following command to query data in the test_hive table:

    select * from hive.default.test_hive;