Build a MaxCompute data lakehouse by using DataWorks and DLF -

Limitations

The data lakehouse solution is supported only in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).
MaxCompute, OSS, ApsaraDB RDS, and DLF must be deployed in the same region.

Ingest MySQL data into a data lake

For details on data ingestion in DLF, see Quick start.

Step 1: Create a data lake metadatabase

Log on to the DLF console. In the top navigation bar, select a region. In the left-side navigation pane, choose MetaData > Metadata Management and create a metadatabase. For more information, see Databases, tables, and functions.

Step 2: Import data to OSS with DataWorks

Prepare the source data.

Log on to the ApsaraDB RDS console, select a region, and then click Instances in the left-side navigation pane.
Select the target ApsaraDB RDS instance and log on to the database.

Create a table in the ApsaraDB RDS database and insert some test data. For more information, see Use DMS to log on to an ApsaraDB RDS for MySQL instance. For example, you can create a table named rds_mc with the following SQL statements:

CREATE TABLE `rds_mc` (
  `id` varchar(32) ,
  `name` varchar(32) ,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(1,"Alice");
INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(2,"zhangsan");
INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(3,"zhaosi");
INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(4,"wangwu");
INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(5,"55555");
INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(8,"6666");
SELECT * FROM `rds_mc`;

准备RDS MySQL数据源。

在DataWorks中配置MySQL数据源，具体操作请参见Add a MySQL data source。
准备OSS数据源。

在DataWorks中配置OSS数据源，具体操作请参见Add an OSS data source。

创建并执行数据同步任务。

在DataWorks的数据开发模块中创建离线同步任务，详情请参见Codeless UI configuration。关键配置如下：

网络与资源配置。 image..png

参数		说明
来源	数据来源	MySQL。
来源	数据源名称	已创建的MySQL数据源。
资源组	我的资源组	已创建的数据集成独享资源组。
去向	数据去向	OSS。
去向	数据源名称	已创建的OSS数据源。

配置任务。

在Configure tasks页签填写表和文件名。

参数

说明

表

RDS数据库中已创建的表名。

文件名（含路径）

格式：<OSS中已创建的文件目录名称>/<待导出至OSS的数据文件>。

例如doc-test-01/datalake/anti.csv。

单击离线同步任务配置页面左上角的图标保存配置，然后单击图标运行任务。
DataWorks中任务运行成功后，可在OSS数据源配置路径下查看数据是否导入成功。

Step 3: Discover metadata with DLF

In the DLF console, use metadata discovery to ingest the data. For more information, see Metadata discovery.

Step 4: View data lake metadata

In the DLF console, click MetaData > Metadata Management, go to the target database, and view the table information on the Table List tab.

Important

If the serialization method for the table after metadata discovery is org.apache.hadoop.hive.serde2.OpenCSVSerde, MaxCompute may misinterpret the field types as an opencsv-specific string type instead of a standard string type. This can cause query failures. To resolve this issue, you must manually change the data type of all affected fields to string in DLF.

Authorization

Building a data lakehouse with MaxCompute, DLF, and OSS requires authorization because the account used to create the MaxCompute project lacks default access to DLF and OSS. Grant permissions using one of the following methods:

One-click authorization: Recommended if you use the same account for the MaxCompute project, DLF, and OSS. Click Authorize DLF and OSS to grant the required permissions.
Custom authorization: This method can be used with either the same or different accounts for the MaxCompute project, DLF, or OSS. For more information, see Custom authorization.

Create an external project in MaxCompute

Create an external project in the DataWorks console.

登录DataWorks控制台，选择地域为华东2（上海）。
在DataWorks控制台页面左侧导航栏，选择More > 数据湖集成（湖仓一体）。
在Lake and Warehouse Integration (Data Lakehouse)页面，单击Start。

在Create Data Lakehouse页面，按照界面指引进行操作。参数示例如下所示。

表 1. 创建数据仓库

参数	说明
External Project Name	ext_dlf_delta
MaxCompute Project	ms_proj1

表 2. 创建外部数据湖连接

参数		说明
Heterogeneous Data Platform Type	选择阿里云DLF+OSS数据湖连接	无
阿里云DLF+OSS数据湖连接	External Project Description	无
	DLF所在区	cn-shanghai
	DLF Endpoint	dlf-share.cn-shanghai.aliyuncs.com
	DLF数据库名称	datalake
	DLF RoleARN	无

单击Create后单击Preview。

如果能预览DLF库中表的信息，则表示操作成功。

Query external project data with MaxCompute

In the DataWorks ad hoc query page, query the table data in the external project.

Note

For more information about ad hoc queries in DataWorks, see Use ad hoc queries to run SQL statements.

Sample command:
```
select * from ext_dlf_delta.rds_mc;
```
Result: The command returns the data records from the id and name columns of the rds_mc table. This confirms that the data lake metadata was synchronized successfully.