MaxCompute provides the lakehouse solution that enables you to build a data management platform that combines data lakes and data warehouses. This solution integrates the flexibility and broad ecosystem compatibility of data lakes with the enterprise-class deployment of data warehouses. This topic describes how to use MaxCompute and heterogeneous data platforms to implement the lakehouse solution. The lakehouse solution is in public preview.

Background information

The lakehouse solution is implemented by using MaxCompute and a data lake. MaxCompute serves as a data warehouse in the lakehouse solution. The following content describes the implementation in two scenarios.
  • The lakehouse solution is implemented by integrating MaxCompute with Data Lake Formation (DLF) and Object Storage Service (OSS). In this scenario, all schemas of the data lake are stored in DLF. MaxCompute can use the metadata management capability of DLF to efficiently process semi-structured data in OSS. The OSS semi-structured data includes data in the Delta Lake, Apache Hudi, AVRO, CSV, JSON, Parquet, and ORC formats.
  • The lakehouse solution is implemented by integrating MaxCompute with a Hadoop cluster. In this scenario, you can use a Hadoop cluster that is deployed in a data center, on virtual machines (VMs) in the cloud, or in Alibaba Cloud E-MapReduce (EMR). If MaxCompute is connected to the virtual private cloud (VPC) in which the Hadoop cluster is deployed, MaxCompute can directly access Hive metastores and map metadata to external projects of MaxCompute.

Limits

  • The lakehouse solution is supported in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), Singapore, and Germany (Frankfurt).
  • MaxCompute can be deployed in a different region from OSS and DLF, but OSS and DLF must be deployed in the same region.

Prerequisites

Before you implement the lakehouse solution, make sure that the following prerequisites are met:
  • MaxCompute is activated and a MaxCompute project is created. For more information, see Activate MaxCompute and Create a MaxCompute project.
    Note

    If MaxCompute is activated, you can directly use MaxCompute. If MaxCompute is not activated, we recommend that you enable the Hive-compatible data type edition when you activate MaxCompute.

    If you want to implement the lakehouse solution by integrating MaxCompute with a Hadoop cluster, we recommend that the VPC in which the Hadoop cluster is deployed is in the same region as MaxCompute. This way, you are not charged for cross-region network connections.

  • Before you implement the lakehouse solution by integrating MaxCompute with DLF and OSS, make sure that the following prerequisites are met:
    • DLF is activated. You can activate DLF on the buy page of DLF.
    • OSS is activated. For more information, see Activate OSS.
  • Before you implement the lakehouse solution by integrating MaxCompute with a Hadoop cluster, make sure that the high availability (HA) feature is enabled for the Hadoop cluster. For more information, contact O&M engineers of the Hadoop cluster.

Step 1: Authorize MaxCompute to access your cloud resources

  • If you want to implement the lakehouse solution by integrating MaxCompute with a Hadoop cluster, complete authorization by using the following method:

    Authorize MaxCompute to create an elastic network interface (ENI) in your VPC. After the ENI is created, MaxCompute is connected to your VPC. You can use the Alibaba Cloud account to which the VPC belongs to log on to the Resource Access Management (RAM) console and complete authorization on the Cloud Resource Access Authorization page.

  • If you want to implement the lakehouse solution by integrating MaxCompute with DLF and OSS, complete authorization by using one of the following methods:
    The account that is used to create the MaxCompute project cannot access DLF without authorization.
    • One-click authorization: If you use the same account to create the MaxCompute project and deploy DLF, we recommend that you perform one-click authorization on the Cloud Resource Access Authorization page in the RAM console.
    • Custom authorization: You can use this method regardless of whether the same account is used to create the MaxCompute project and deploy DLF. For more information, see Authorize a RAM user to access DLF.

Step 2: Create a lakehouse in the DataWorks console

  1. Log on to the DataWorks console, and select a region in which the lakehouse solution is supported.
    Note For more information about the regions in which the lakehouse solution is supported, see Limits.
  2. In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).
  3. On the Lake and Warehouse Integration (Data Lakehouse) page, click Start.
  4. On the Create Data Lakehouse page, configure the parameters. The following tables describe the parameters.
    Table 1. Create Data Warehouse
    Parameter Description
    External Project Name The custom name of the external project. The name must meet the following requirements:
    • The name can contain only letters, digits, and underscores (_), and must start with a letter.
    • The name must be 1 to 128 characters in length.
    Note For more information about external projects, see Project.
    MaxCompute Project Select a MaxCompute project from the drop-down list. If no MaxCompute project exists, you can click Create Project in MaxCompute Console to create a project. For more information, see Create a MaxCompute project.
    Table 2. Create Data Lake Connection
    Parameter Description
    Heterogeneous Data Platform Type
    • Alibaba Cloud E-MapReduce/Hadoop Cluster: Select this option if you want to implement the lakehouse solution by integrating MaxCompute with a Hadoop cluster.
    • Alibaba Cloud DLF + OSS: Select this option if you want to implement the lakehouse solution by integrating MaxCompute with DLF and OSS.
    Alibaba Cloud E-MapReduce/Hadoop Cluster Network Connection Select a connection from MaxCompute to the VPC in which an external data source is deployed from the drop-down list. The external data source can be an Alibaba Cloud EMR Hadoop cluster or a self-managed Hadoop cluster. You can also establish a connection. For more information about the parameter, see Step 3 in VPC connection scheme.
    Note
    • You are not charged for network connections in the public review phase of the lakehouse solution.
    • For more information about network connections, see Terms.
    External Data Source An external data source stores the information that is required for establishing a connection from MaxCompute to an external data source. The information includes Uniform Resource Locators (URLs), port numbers, and user authentication information. Select an EMR Hadoop cluster or a self-managed Hadoop cluster from the drop-down list. You can also create an EMR Hadoop cluster or a self-managed Hadoop cluster. For more information about the parameter, see Step 3 of "Create an external data source" in Manage external data sources.
    Alibaba Cloud DLF + OSS External Project Description Optional. The description of the external project.
    Region Where DLF Is Activated The ID of the region where DLF resides. Valid values:
    • China (Hangzhou): cn-hangzhou
    • China (Shanghai): cn-shanghai
    • China (Beijing): cn-beijing
    • China (Shenzhen): cn-shenzhen
    • China (Zhangjiakou): cn-zhangjiakou
    • Singapore: ap-southeast-1
    • Germany (Frankfurt): eu-central-1
    DLF Endpoint The internal endpoint of DLF. Select an endpoint based on the region. Valid values:
    • China (Hangzhou): dlf-share.cn-hangzhou.aliyuncs.com
    • China (Shanghai): dlf-share.cn-shanghai.aliyuncs.com
    • China (Beijing): dlf-share.cn-beijing.aliyuncs.com
    • China (Zhangjiakou): dlf-share.cn-zhangjiakou.aliyuncs.com
    • China (Shenzhen): dlf-share.cn-shenzhen.aliyuncs.com
    • Singapore: dlf-share.ap-southeast-1.aliyuncs.com
    • Germany (Frankfurt): dlf-share.eu-central-1.aliyuncs.com
    DLF Database Name The name of the database in DLF. You can log on to the DLF console and choose Metadata > Databases in the left-side navigation pane to obtain the database name.
    DLF RoleARN Optional. The Alibaba Cloud Resource Name (ARN) of the RAM role. If you use the custom authorization method, you must configure this parameter. You can log on to the RAM console, choose Identities > Roles in the left-side navigation pane, and then click the name of the RAM role to obtain the ARN of the RAM role.
    Table 3. Create Data Mapping
    Parameter Description
    External Data Source Object By default, this parameter is set to the value of External Data Source.
    Destination Database The database in the Hadoop cluster.

Step 3: Manage the lakehouse in the DataWorks console

  • Use the lakehouse.

    1. In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).
    2. On the Lake and Warehouse Integration (Data Lakehouse) page, find the external project that you want to use.
      • Use the lakehouse.

        Click Use Data Lakehouse in the Actions column of the external project.

      • Update the external project.
        Click Project configuration in the Actions column of the external project. In the Project configuration dialog box, update the information about the external project.
        Note You can change the database name of the external data source that is mapped to the external project, and select another external data source. If you want to delete an external data source, go to the Manage External Data Sources tab in the MaxCompute console, find the external data source, and click Delete in the Actions column. You cannot update an external data source.
      • Delete the external project.
        Click Delete in the Actions column of the external project.
        Note The external project is logically deleted and enters the silent state. The external project is completely deleted after 15 days. You cannot create an external project with the same name during this period of time.
  • View the metadata of an external project.

    1. In the left-side navigation pane of the DataWorks console, click Workspaces.
    2. In the workspace list, find the workspace that is mapped to your external project and click Data Map in the Actions column.
    3. On the Data Map page, enter the table name in the search box and click Search. You can also go to the All Data tab, select your external project from the Project drop-down list, enter the table name in the search box, and then click Search.
      Note
      • The Apply for Permission and View Lineage features of the table are unavailable.
      • The metadata in the table is updated on a T+1-day basis. In this case, the changes you made on the table in the external project that is mapped to the external data source, such as the Hive database, are synchronized to Data Map of DataWorks after T+1 days. Metadata in MaxCompute is updated in real time.