After you create a Presto cluster in the E-MapReduce (EMR) console, configure the cluster before you use the cluster. This topic describes how to configure connectors and a metadata storage center for data in data lakes to use the Presto cluster that you created.

Background information

To use the Presto service, create a Hadoop cluster with the Presto service deployed or create a Presto cluster in the EMR console. A Presto cluster contains only the following required services: SmartData, Presto, Hudi, Iceberg, and Hue. The Presto cluster has the following features:
  • Allocates exclusive resources to Presto. This way, other services can hardly affect the operation of Presto.
  • Supports auto scaling.
  • Supports analysis of data in data lakes and real-time data warehousing.
  • Does not store data.
Note
  • Hudi and Iceberg are not processes and do not occupy cluster resources.
  • If you do not need to use the Hue and SmartData services in the cluster, you can stop the services.

Before you use a Presto cluster, create a Hadoop cluster or use an existing Hadoop cluster as the data cluster.

After you create a Presto cluster, perform the following operations:

Prerequisites

A Presto cluster is created in the EMR console. For more information, see Create a cluster.

Limits

  • Only EMR V3.38.0 and later minor versions support Presto clusters.
  • The Presto cluster must be deployed in the same virtual private cloud (VPC) as the Hadoop cluster.

Configure connectors

Configure query objects in connectors that you want to use. In this example, a Hive connector is used.

  1. Go to the Presto service page of the Presto cluster.
    1. Log on to the Alibaba Cloud EMR console.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Cluster Management tab.
    4. On the Cluster Management page, find your cluster and click Details in the Actions column.
    5. In the left-side navigation pane of the page that appears, choose Cluster Service > Presto.
  2. Modify configuration items.
    1. On the Presto service page, click the Configure tab.
    2. In the Service Configuration section, click the hive.properties tab.
    3. Find the hive.metastore.uri parameter and change the value of this parameter to the Uniform Resource Identifier (URI) of the Hive metastore service that you can access in the Hadoop cluster.
      Hive Properties
      Note
      • The Hive connector provides the hive.config.resources parameter that can be configured to load configuration files. A Presto cluster does not contain components such as Hadoop. You can configure the hive.config.resources parameter to load the following configuration files: core-site, hdfs-site, and hbase-site. By default, these configuration files are stored in the etc directory in the root directory of Presto. On the Configure tab of the Presto service page in the EMR console, you can view or modify the content of the configuration files. You can also customize configuration items.
      • By default, Hadoop clusters allow you to use the following built-in configuration files: core-site, hdfs-site, and hbase-site. You can modify the content of the configuration files on the Configure tab of each of the service pages. If you modify the content of the configuration files on the Configure tab of one of the service pages, you can skip the configurations of these configuration files on the Configure tab of the Presto service page.
  3. Save configurations.
    1. Click Save in the upper-right corner of the Service Configuration section.
    2. In the Confirm Changes dialog box, configure the parameters and click OK.
  4. Make the configurations take effect.
    1. Click Deploy Client Configuration in the upper-right corner of the Service Configuration section.
    2. In the Cluster Activities dialog box, configure the parameters and click OK.
    3. In the Confirm message, click OK.
  5. Restart the Presto service. For more information, see Restart a service.
  6. Configure the hosts file.
    Notice If all the data that you want to query is stored in Object Storage Service (OSS) buckets or the Location parameter is specified when you execute the CREATE TABLE statement, you do not need to configure the hosts file.

    When you create Hive tables, some Hive tables are stored in the directory that starts with emr-header-1.cluster by default. When you query data that is stored in the Hadoop cluster, configure the hosts file for each node in the Presto cluster to make sure that you can read data from these tables in the query process.

    • Method 1: Log on to the EMR console and add a cluster script or a bootstrap action to configure the hosts file. For more information, see Cluster scripts or Add a bootstrap action. We recommend that you use this method.
    • Method 2: Directly modify the hosts file. Perform the following steps to modify the hosts file:
      1. Obtain the internal IP address of the master node of the Hadoop cluster. Log on to the EMR console. Find the Hadoop cluster on the Cluster Management page and click Details in the Actions column. In the Instance Info section of the Cluster Overview page, you can view the internal IP addresses of all nodes. Intranet IP address
      2. Log on to the Hadoop cluster. For more information, see Log on to a cluster.
      3. Run the hostname command to obtain the hostname of the master node.

        The hostname is in a format that is similar to emr-header-1.cluster-26****.

      4. Log on to the Presto cluster. For more information, see Log on to a cluster.
      5. Run the following command to edit the hosts file:
        vim /etc/hosts
      6. Add the following content to the end of the hosts file:
        Add the internal IP address and hostname of the master node of the Hadoop cluster to the hosts file that is stored in the /etc/ directory of each node of the Presto cluster.
        192.168.**.** emr-header-1.cluster-26****

Configure a metadata storage center for data in data lakes

If you select Data Lake Metadata for the Type parameter when you create the Hadoop cluster, you need to perform additional configuration for connectors such as Hive, Iceberg, and Hudi connectors to access data in data lakes. metadata
The following table describes the parameters that are used to configure a metadata storage center for data in data lakes.
Parameter Description Remarks
hive.metastore The type of the Hive metastore. This parameter is fixed to DLF.
dlf.catalog.region The ID of the region in which Data Lake Formation (DLF) is activated. For more information, see Supported regions and endpoints.
Note Make sure that the value of this parameter matches the endpoint specified by the dlf.catalog.endpoint parameter.
dlf.catalog.endpoint The endpoint of the DLF service. For more information, see Supported regions and endpoints.
We recommend that you set the dlf.catalog.endpoint parameter to a VPC endpoint of DLF. For example, if you select the China (Hangzhou) region, set the dlf.catalog.endpoint parameter to dlf-vpc.cn-hangzhou.aliyuncs.com.
Note You can also use a public endpoint of DLF. If you select the China (Hangzhou) region, set the dlf.catalog.endpoint parameter to dlf.cn-hangzhou.aliyuncs.com.
dlf.catalog.akMode The AccessKey mode of the DLF service. We recommend that you set this parameter to EMR_AUTO.
dlf.catalog.proxyMode The proxy mode of the DLF service. We recommend that you set this parameter to DLF_ONLY.
dlf.catalog.uid The ID of your Alibaba Cloud account. To obtain the ID of your Alibaba Cloud account, go to the Security Settings page.

Example: Query data in a table

  1. Log on to the Presto cluster. For more information, see Log on to a cluster.
  2. Run the following command to connect to the Presto client:
    presto --server emr-header-1:9090
  3. Run the following command to query data in the test_hive table:
    select * from hive.default.test_hive;
    The following output is returned:
     id
    ----
     3
     2
     1