All Products
Search
Document Center

MaxCompute:Build a data lakehouse by using MaxCompute and Hadoop

Last Updated:Jan 25, 2024

MaxCompute allows you to build a data lakehouse by using MaxCompute and Hadoop for unified management, storage, and analysis of large amounts of data. The data lakehouse provides an integrated data platform that not only processes structured and semi-structured data but also is engaged in high-concurrency data analysis scenarios. This topic describes how to build a data lakehouse by using MaxCompute and Hadoop and manage data lakehouse projects.

Prerequisites

  • MaxCompute is activated and a MaxCompute project is created. For more information, see Activate MaxCompute and Create a MaxCompute project.

    Note
    • If MaxCompute is activated, you can directly use MaxCompute. If MaxCompute is not activated, we recommend that you enable the Hive-compatible data type edition when you activate MaxCompute.

    • If you want to build a data lakehouse by using MaxCompute and Hadoop, we recommend that the virtual private cloud (VPC) in which Hadoop is deployed is in the same region as MaxCompute. This way, you are not charged for cross-region network connections.

  • Before you build a data lakehouse by using MaxCompute and Hadoop, make sure that the high availability (HA) feature is enabled for Hadoop. For more information, contact O&M engineers of the Hadoop cluster.

Limits

  • The data lakehouse is supported in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).

  • The VPC in which Hadoop is deployed must be in the same region as MaxCompute.

Procedure

Step 1: Authorize MaxCompute to access your cloud resources

If you want to build a data lakehouse by using MaxCompute and Hadoop, complete authorization by using the following method: Authorize MaxCompute to create an elastic network interface (ENI) in your VPC. After the ENI is created, MaxCompute is connected to your VPC. You can log on to the Resource Access Management (RAM) console with the Alibaba Cloud account to which the VPC belongs and complete authorization on the Cloud Resource Access Authorization page.

Step 2: Create a data lakehouse in the DataWorks console

  1. Log on to the DataWorks console, and select a region in which the data lakehouse is supported.

    Note

    For more information about the regions in which the data lakehouse is supported, see Limits.

  2. In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).

  3. On the Lake and Warehouse Integration (Data Lakehouse) page, click Start.

  4. On the Create Data Lakehouse page, configure the parameters. The following tables describe the parameters.

    Table 1. Create Data Warehouse

    Parameter

    Description

    External Project Name

    The custom name of the external project. The name must meet the following requirements:

    • The name can contain only letters, digits, and underscores (_), and must start with a letter.

    • The name must be 1 to 128 characters in length.

    Note

    For more information about external projects, see Project.

    MaxCompute Project

    Select a MaxCompute project from the drop-down list. If no MaxCompute project exists, you can click Create Project in MaxCompute Console to create a project. For more information, see Create a MaxCompute project.

    Note

    If you cannot select a project from the MaxCompute Project drop-down list, you must associate the project with a DataWorks workspace in the DataWorks console. For more information, see Associate a compute engine with a workspace.

    Table 2. Create Data Lake Connection

    Parameter

    Description

    Heterogeneous Data Platform Type

    • Alibaba Cloud E-MapReduce/Hadoop Cluster: Select this option if you want to build a data lakehouse by using MaxCompute and Hadoop.

    • Alibaba Cloud DLF + OSS: Select this option if you want to build a data lakehouse by using MaxCompute, DLF, and OSS.

    Alibaba Cloud E-MapReduce/Hadoop Cluster

    Network Connection

    Select a connection from MaxCompute to the VPC in which an external data source is deployed from the drop-down list. The external data source can be an Alibaba Cloud EMR Hadoop cluster or a self-managed Hadoop cluster. You can also establish a connection. For more information about the parameter, see Step c Establish a network connection between MaxCompute and the destination VPC of "Access over a VPC (dedicated connection)" in Network connection process.

    Note
    • You are not charged for network connections in the public review phase of the data lakehouse.

    • For more information about network connections, see Terms.

    External Data Source

    An external data source stores the information that is required for establishing a connection from MaxCompute to an external data source. The information includes Uniform Resource Locators (URLs), port numbers, and user authentication information. Select an EMR Hadoop cluster or a self-managed Hadoop cluster from the drop-down list. You can also create an EMR Hadoop cluster or a self-managed Hadoop cluster. For more information, see Step 3 of "Create an external data source" in Manage external data sources.

    Table 3. Create External Data Source

    Parameter

    Description

    Select MaxCompute Project

    Select the desired MaxCompute project from the drop-down list. You can view the name of the MaxCompute project on the Projects page.

    External Data Source Name

    The custom name of the external data source. The name must meet the following requirements:

    • The name can contain only lowercase letters, digits, and underscores (_).

    • The name must be less than 128 characters in length.

    Network Connection Object

    The network connection from MaxCompute to the VPC in which the E-MapReduce (EMR) Hadoop cluster or the self-managed Hadoop cluster is deployed. For more information, see Network connection process.

    NameNode Address

    The IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 8020. For more information, contact the Hadoop cluster administrator.

    HMS Service Address

    The Hive Metastore Service (HMS) IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 9083. For more information, contact the Hadoop cluster administrator.

    Cluster Name

    The name of the cluster. For a HA Hadoop cluster, the value of this parameter is the same as the name of a NameNode process. For a self-managed Hadoop cluster, you can obtain the cluster name based on the dfs.nameservices parameter in the hdfs-site.xml file.

    Authentication Type

    MaxCompute uses account mappings to obtain metadata and data from Hadoop clusters. The mapped Hadoop accounts are protected by using an authentication mechanism, such as Kerberos authentication. Therefore, the files that contain authentication information of the mapped accounts are required. Select an authentication type based on your business requirements. For more information, contact O&M engineers of the Hadoop cluster.

    • No Authentication: Select this option if the Kerberos authentication mechanism is disabled for the Hadoop cluster.

    • Kerberos Authentication: Select this option if the Kerberos authentication mechanism is enabled for the Hadoop cluster.

      • Configuration File: You can click Upload KRB5.conf File to upload the krb5.conf file of the Hadoop cluster.

        Note

        If the Hadoop cluster runs in a Linux operating system, the krb5.conf file is stored in the /etc directory for the HDFS NameNode processes on the master node of the Hadoop cluster.

      • hmsPrincipals: the identity of the HMS service. You can run the list_principals command on the Kerberos terminal of the Hadoop cluster to obtain the value of this parameter. Example value:

        hive/emr-header-1.cluster-20****@EMR.20****.COM,hive/emr-header-2.cluster-20****@EMR.20****.COM
        Note

        The value of this parameter is a comma-delimited string. One principal is mapped to one HMS IP address.

      • Add Account Mapping

        • Account: the Alibaba Cloud account that can access the Hadoop cluster by using MaxCompute.

        • Kerberos Account: the Kerberos-authenticated Hadoop user account that is allowed to access the HMS service.

        • Upload File: You can click Upload Keytab File to upload the keytab file of the Kerberos-authenticated Hadoop user account. For more information about how to create a keytab file, see Create a keytab configuration file.

    Table 4. Create Data Mapping

    Parameter

    Description

    External Data Source Object

    By default, this parameter is set to the value of External Data Source.

    Destination Database

    The database in the Hadoop cluster.

Step 3: Manage the data lakehouse in the DataWorks console

Use the data lakehouse

  1. In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).

  2. On the Lake and Warehouse Integration (Data Lakehouse) page, find the external project that you want to use.

    • Use the data lakehouse.

      Click Use Data Lakehouse in the Actions column of the external project.

    • Update the external project.

      Click Project configuration in the Actions column of the external project. In the Project configuration dialog box, update the information about the external project.

      Note

      You can change the database name of the external data source that is mapped to the external project, and select another external data source. If you want to delete an external data source, go to the Manage External Data Sources tab in the MaxCompute console, find the external data source, and click Delete in the Actions column. You cannot update an external data source.

  3. Delete the external project.

    Click Delete in the Actions column of the external project.

    Note

    The external project is logically deleted and enters the silent state. The external project is completely deleted after 15 days. You cannot create an external project with the same name during this period of time.

View the metadata of an external project in the data lakehouse

  1. In the left-side navigation pane of the DataWorks console, click Workspaces.

  2. On the Workspaces page, find the workspace that is mapped to your external project and choose Shortcuts > Data Map in the Actions column.

  3. On the Data Map page, enter the table name in the search box and click Search. You can also go to the All Data tab, select your external project from the Project drop-down list, enter the table name in the search box, and then click Search.

    Note
    • The Apply for Permission and View Lineage features of the table are unavailable.

    • The metadata in the table is updated on a T+1-day basis. In this case, the changes you made on the table in the external project that is mapped to the external data source, such as the Hive database, are synchronized to Data Map of DataWorks after T+1 days. Metadata in MaxCompute is updated in real time.

References

For more information about how to build a data lakehouse that supports the Delta Lake or Apache Hudi storage mechanism based on Hadoop clusters, see Delta Lake or Apache Hudi storage mechanism based on Hadoop clusters.