MaxCompute provides the lakehouse solution that enables you to build a data management platform that combines data lakes and data warehouses. This solution integrates the flexibility and broad ecosystem compatibility of data lakes with the enterprise-class deployment of data warehouses. This topic describes how to use MaxCompute and heterogeneous data platforms to implement the lakehouse solution. The lakehouse solution is in public preview.
Background information
- The lakehouse solution is implemented by integrating MaxCompute with Data Lake Formation (DLF) and Object Storage Service (OSS). In this scenario, all schemas of the data lake are stored in DLF. MaxCompute can use the metadata management capability of DLF to efficiently process semi-structured data in OSS. The OSS semi-structured data includes data in the Delta Lake, Apache Hudi, AVRO, CSV, JSON, Parquet, and ORC formats.
- The lakehouse solution is implemented by integrating MaxCompute with a Hadoop cluster. In this scenario, you can use a Hadoop cluster that is deployed in a data center, on virtual machines (VMs) in the cloud, or in Alibaba Cloud E-MapReduce (EMR). If MaxCompute is connected to the virtual private cloud (VPC) in which the Hadoop cluster is deployed, MaxCompute can directly access Hive metastores and map metadata to external projects of MaxCompute.
Limits
- The lakehouse solution is supported in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), Singapore, and Germany (Frankfurt).
- MaxCompute can be deployed in a different region from OSS and DLF, but OSS and DLF must be deployed in the same region.
Prerequisites
- MaxCompute is activated and a MaxCompute project is created. For more information,
see Activate MaxCompute and Create a MaxCompute project.
Note
If MaxCompute is activated, you can directly use MaxCompute. If MaxCompute is not activated, we recommend that you enable the Hive-compatible data type edition when you activate MaxCompute.
If you want to implement the lakehouse solution by integrating MaxCompute with a Hadoop cluster, we recommend that the VPC in which the Hadoop cluster is deployed is in the same region as MaxCompute. This way, you are not charged for cross-region network connections.
- Before you implement the lakehouse solution by integrating MaxCompute with DLF and
OSS, make sure that the following prerequisites are met:
- DLF is activated. You can activate DLF on the buy page of DLF.
- OSS is activated. For more information, see Activate OSS.
- Before you implement the lakehouse solution by integrating MaxCompute with a Hadoop cluster, make sure that the high availability (HA) feature is enabled for the Hadoop cluster. For more information, contact O&M engineers of the Hadoop cluster.
Step 1: Authorize MaxCompute to access your cloud resources
- If you want to implement the lakehouse solution by integrating MaxCompute with a Hadoop
cluster, complete authorization by using the following method:
Authorize MaxCompute to create an elastic network interface (ENI) in your VPC. After the ENI is created, MaxCompute is connected to your VPC. You can use the Alibaba Cloud account to which the VPC belongs to log on to the Resource Access Management (RAM) console and complete authorization on the Cloud Resource Access Authorization page.
- If you want to implement the lakehouse solution by integrating MaxCompute with DLF
and OSS, complete authorization by using one of the following methods:
The account that is used to create the MaxCompute project cannot access DLF without authorization.
- One-click authorization: If you use the same account to create the MaxCompute project and deploy DLF, we recommend that you perform one-click authorization on the Cloud Resource Access Authorization page in the RAM console.
- Custom authorization: You can use this method regardless of whether the same account is used to create the MaxCompute project and deploy DLF. For more information, see Authorize a RAM user to access DLF.
Step 2: Create a lakehouse in the DataWorks console
Step 3: Manage the lakehouse in the DataWorks console
-
Use the lakehouse.
- In the left-side navigation pane of the DataWorks console, click Lake and Warehouse Integration (Data Lakehouse).
- On the Lake and Warehouse Integration (Data Lakehouse) page, find the external project that you want to use.
- Use the lakehouse.
Click Use Data Lakehouse in the Actions column of the external project.
- Update the external project.
Click Project configuration in the Actions column of the external project. In the Project configuration dialog box, update the information about the external project.Note You can change the database name of the external data source that is mapped to the external project, and select another external data source. If you want to delete an external data source, go to the Manage External Data Sources tab in the MaxCompute console, find the external data source, and click Delete in the Actions column. You cannot update an external data source.
- Delete the external project.
Click Delete in the Actions column of the external project.Note The external project is logically deleted and enters the silent state. The external project is completely deleted after 15 days. You cannot create an external project with the same name during this period of time.
- Use the lakehouse.
-
View the metadata of an external project.
- In the left-side navigation pane of the DataWorks console, click Workspaces.
- In the workspace list, find the workspace that is mapped to your external project and click Data Map in the Actions column.
- On the Data Map page, enter the table name in the search box and click Search. You can also go to
the All Data tab, select your external project from the Project drop-down list, enter the table name in the search box, and then click Search.
Note
- The Apply for Permission and View Lineage features of the table are unavailable.
- The metadata in the table is updated on a T+1-day basis. In this case, the changes you made on the table in the external project that is mapped to the external data source, such as the Hive database, are synchronized to Data Map of DataWorks after T+1 days. Metadata in MaxCompute is updated in real time.