This topic describes how to build and manage a data lakehouse using MaxCompute, Data Lake Formation (DLF), and Object Storage Service (OSS). A data lakehouse integrates a data warehouse and a data lake to provide flexible and efficient data processing.
Usage notes
The data lakehouse feature is available only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt) regions.
MaxCompute, OSS, and DLF must be deployed in the same region.
Procedure
Activate services
On the DLF activation page, activate the DLF service.
Grant permissions for MaxCompute access
When you build a data lakehouse with MaxCompute, DLF, and OSS, the account used for the MaxCompute project cannot access DLF or OSS without authorization. You must grant the required permissions. Two authorization methods are available:
One-click authorization: Use this method if the same account is used to create the MaxCompute project and to deploy DLF and OSS. You can click Authorize DLF and OSS to grant permissions with one click.
Custom authorization: Use this method if the same account or different accounts are used to create the MaxCompute project and to deploy DLF and OSS. For more information, see Custom authorization.
Build a data lakehouse in DataWorks
Log on to the DataWorks console and select a region in the upper-left corner.
For more information about the supported regions, see Usage notes.
In the left navigation pane, choose .
On the Lake and Warehouse Integration (Data Lakehouse) page, click Start.
On the Create Data Warehouse page, follow the on-screen instructions.
The following table describes the parameters.
Create Data Warehouse:
Parameter
Description
External Project Name
A custom name for the external project. The name must follow these conventions:
The name must start with a letter and can contain only letters, underscores (_), and digits.
The name can be up to 128 characters in length.
For more information about the basic concepts of external projects, see Project concepts.
MaxCompute Project
Select a MaxCompute project.
If you do not have a MaxCompute project, see Create a MaxCompute project.
If the target project is not in the drop-down list, see Attach the target project in the DataWorks console.
Create Data Lake Connection
Parameter
Description
Heterogeneous Data Platform Type
Alibaba Cloud E-MapReduce/Hadoop Cluster: Use MaxCompute and Hadoop to build a data lakehouse.
Alibaba Cloud DLF + OSS: Use MaxCompute, DLF, and OSS to build a data lakehouse.
Alibaba Cloud DLF + OSS
External Project Description
Optional. The description of the external project.
Region Where DLF Is Activated
The region where the DLF service is activated. Select a region as needed. Valid values:
Hangzhou: cn-hangzhou
Shanghai: cn-shanghai
Beijing: cn-beijing
Shenzhen: cn-shenzhen
Zhangjiakou: cn-zhangjiakou
Singapore: ap-southeast-1
Frankfurt: eu-central-1
DLF Endpoint
The internal endpoint of the DLF service. Select an endpoint based on your region. Valid values:
China (Hangzhou): dlf-share.cn-hangzhou.aliyuncs.com
China (Shanghai): dlf-share.cn-shanghai.aliyuncs.com
China (Beijing): dlf-share.cn-beijing.aliyuncs.com
China (Zhangjiakou): dlf-share.cn-zhangjiakou.aliyuncs.com
China (Shenzhen): dlf-share.cn-shenzhen.aliyuncs.com
China (Hong Kong): dlf-share.cn-hongkong.aliyuncs.com
Singapore (ap-southeast-1): dlf-share.ap-southeast-1.aliyuncs.com
Germany (Frankfurt): dlf-share.eu-central-1.aliyuncs.com
DLF Database Name
The name of the destination DLF database to which you want to connect.
Method:
Log on to the Data Lake Formation (DLF) console and select a region in the upper-left corner.
In the navigation pane on the left, choose .
On the Metadata page, click the Table tab.
Obtain the DLF database name.
Currently, only databases in the default DLF Catalog are supported.
Log on to the Data Lake Formation (DLF) console and select a region in the upper-left corner.
In the navigation pane on the left, choose .
On the Metadata page, click the Table tab.
DLF RoleARN
Optional. The Alibaba Cloud Resource Name (ARN) of the RAM role. This parameter is required if you use a custom authorization method.
Obtaining method:
Log on to the Resource Access Management (RAM) console.
In the navigation pane on the left, choose .
On the Roles page, click the target Role Name to open its details page.
In the Basic Information section, you can find the ARN.
Manage the data lakehouse in DataWorks
Log on to the DataWorks console and select a region in the upper-left corner.
In the left navigation pane, choose .
On the Other Items > Lake and Warehouse Integration (Data Lakehouse) page, you can perform the following operations:
Select the target external project and click Use Data Lakehouse in the Actions column to get started.
Click Project configuration in the Actions column of the target external project. In the Project configuration dialog box, you can update the external project information.
You can update the database name of the external data source that is mapped to the MaxCompute external project and reselect the external data source. You cannot update an existing external data source. To delete an external data source, go to its page.
Click Delete in the Actions column of the target external project to delete the project. The external project is logically deleted and enters a silent state. The project is permanently deleted after 15 days. You cannot create an external project with the same name during this period.
View the metadata of a data lakehouse external project
Log on to the DataWorks console and select a region in the upper-left corner.
In the left navigation pane, click Workspace.
On the Workspaces page, find the target workspace and in the Actions column, choose .
Select the workspace that is attached to the external project.
In the search box on the Shortcuts Data Map page, or by clicking the
icon in the left navigation pane, you can search for the table name in an external project on the Directory List tab on the right.
The metadata in the table is updated on the next day (T+1). If you modify the table schema at the mapping source, such as in Hive, the changes are synchronized to DataWorks Data Map on the next day. The metadata in the MaxCompute engine is updated in real time.
References
For more information about data lakehouse solutions that use DLF and RDS or Flink and OSS to support Delta Lake or Hudi storage, see Using DLF and RDS or Flink and OSS to support Delta Lake or Hudi storage mechanisms.