All Products
Search
Document Center

:Build a MaxCompute data lakehouse by using DataWorks and DLF

Last Updated:Jun 21, 2026

Because the data ingestion feature in DLF is no longer updated, this topic shows you how to use DataWorks data integration to ingest data from an ApsaraDB RDS for MySQL instance. You will also learn how to create an external project in MaxCompute to query table data managed by DLF.

Limitations

  • The data lakehouse solution is supported only in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).

  • MaxCompute, OSS, ApsaraDB RDS, and DLF must be deployed in the same region.

Ingest MySQL data into a data lake

For details on data ingestion in DLF, see Quick start.

Step 1: Create a data lake metadatabase

Log on to the DLF console. In the top navigation bar, select a region. In the left-side navigation pane, choose MetaData > Metadata Management and create a metadatabase. For more information, see Databases, tables, and functions.

Step 2: Import data to OSS with DataWorks

  1. Prepare the source data.

    1. Log on to the ApsaraDB RDS console, select a region, and then click Instances in the left-side navigation pane.

    2. Select the target ApsaraDB RDS instance and log on to the database.

    3. Create a table in the ApsaraDB RDS database and insert some test data. For more information, see Use DMS to log on to an ApsaraDB RDS for MySQL instance. For example, you can create a table named rds_mc with the following SQL statements:

      CREATE TABLE `rds_mc` (
        `id` varchar(32) ,
        `name` varchar(32) ,
          PRIMARY KEY (`id`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(1,"Alice");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(2,"zhangsan");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(3,"zhaosi");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(4,"wangwu");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(5,"55555");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(8,"6666");
      SELECT * FROM `rds_mc`;
  2. 准备RDS MySQL数据源。

    在DataWorks中配置MySQL数据源,具体操作请参见Add a MySQL data source

  3. 准备OSS数据源。

    在DataWorks中配置OSS数据源,具体操作请参见Add an OSS data source

  4. 创建并执行数据同步任务。

    在DataWorks的数据开发模块中创建离线同步任务,详情请参见Codeless UI configuration。关键配置如下:

    1. 网络与资源配置。image..png

      参数

      说明

      来源

      数据来源

      MySQL。

      数据源名称

      已创建的MySQL数据源。

      资源组

      我的资源组

      已创建的数据集成独享资源组。

      去向

      数据去向

      OSS。

      数据源名称

      已创建的OSS数据源。

    2. 配置任务。

      Configure tasks页签填写表和文件名。

      参数

      说明

      RDS数据库中已创建的表名。

      文件名(含路径)

      格式:<OSS中已创建的文件目录名称>/<待导出至OSS的数据文件>。

      例如doc-test-01/datalake/anti.csv

    3. 单击离线同步任务配置页面左上角的image..png图标保存配置,然后单击image..png图标运行任务。

    4. DataWorks中任务运行成功后,可在OSS数据源配置路径下查看数据是否导入成功。

Step 3: Discover metadata with DLF

In the DLF console, use metadata discovery to ingest the data. For more information, see Metadata discovery.

Step 4: View data lake metadata

In the DLF console, click MetaData > Metadata Management, go to the target database, and view the table information on the Table List tab.

Important

If the serialization method for the table after metadata discovery is org.apache.hadoop.hive.serde2.OpenCSVSerde, MaxCompute may misinterpret the field types as an opencsv-specific string type instead of a standard string type. This can cause query failures. To resolve this issue, you must manually change the data type of all affected fields to string in DLF.

Authorization

Building a data lakehouse with MaxCompute, DLF, and OSS requires authorization because the account used to create the MaxCompute project lacks default access to DLF and OSS. Grant permissions using one of the following methods:

  • One-click authorization: Recommended if you use the same account for the MaxCompute project, DLF, and OSS. Click Authorize DLF and OSS to grant the required permissions.

  • Custom authorization: This method can be used with either the same or different accounts for the MaxCompute project, DLF, or OSS. For more information, see Custom authorization.

Create an external project in MaxCompute

Create an external project in the DataWorks console.

  1. 登录DataWorks控制台,选择地域为华东2(上海)。

  2. 在DataWorks控制台页面左侧导航栏,选择More > 数据湖集成(湖仓一体)

  3. Lake and Warehouse Integration (Data Lakehouse)页面,单击Start

  4. Create Data Lakehouse页面,按照界面指引进行操作。参数示例如下所示。

    表 1. 创建数据仓库

    参数

    说明

    External Project Name

    ext_dlf_delta

    MaxCompute Project

    ms_proj1

    表 2. 创建外部数据湖连接

    参数

    说明

    Heterogeneous Data Platform Type

    选择阿里云DLF+OSS数据湖连接

    阿里云DLF+OSS数据湖连接

    External Project Description

    DLF所在区

    cn-shanghai

    DLF Endpoint

    dlf-share.cn-shanghai.aliyuncs.com

    DLF数据库名称

    datalake

    DLF RoleARN

  5. 单击Create后单击Preview

    如果能预览DLF库中表的信息,则表示操作成功。

Query external project data with MaxCompute

In the DataWorks ad hoc query page, query the table data in the external project.

Note

For more information about ad hoc queries in DataWorks, see Use ad hoc queries to run SQL statements.

  • Sample command:

    select * from ext_dlf_delta.rds_mc;
  • Result: The command returns the data records from the id and name columns of the rds_mc table. This confirms that the data lake metadata was synchronized successfully.