All Products
Search
Document Center

MaxCompute:Implement a MaxCompute data lakehouse solution by using DLF to ingest data into a data lake

Last Updated:Jun 28, 2024

The data ingestion feature of Data Lake Formation (DLF) is no longer updated. This topic describes how to use DataWorks Data Integration to ingest MySQL data into a data lake, create an external project for MaxCompute, and query table data in DLF from the external project.

Limits

  • Data lakehouse solutions are supported only in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).

  • MaxCompute, Object Storage Service (OSS), and ApsaraDB RDS for MySQL must be deployed in the same region as DLF.

Import MySQL data into a data lake

For more information about how to use DLF to ingest data into a data lake, see Getting Started.

Step 1: Create a metadatabase of a data lake

Log on to the DLF console. In the top navigation bar, select a region. In the left-side navigation pane, choose Metadata > Metadata. On the Metadata page, create a metadatabase. For more information, see Create a metadatabase.

Step 2: Use DataWorks Data Integration to ingest data to OSS

  1. Prepare the data that you want to ingest.

    1. Log on to the ApsaraDB RDS console. In the top navigation bar, select a region. In the left-side navigation pane, click Instances.

    2. On the Instances page, find the desired ApsaraDB RDS for MySQL instance and log on to the desired database.

    3. Create a table in the database and insert a small amount of test data into the table. For more information, see Use DMS to log on to an ApsaraDB RDS for MySQL instance. For example, you can execute the following statement to create a table named rds_mc:

      CREATE TABLE `rds_mc` (
        `id` varchar(32) ,
        `name` varchar(32) ,
          PRIMARY KEY (`id`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(1,"Bob");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(2,"zhangsan");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(3,"zhaosi");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(4,"wangwu");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(5,"55555");
      INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(8,"6666");
      
      SELECT * FROM `rds_mc`;
  2. Prepare an ApsaraDB RDS for MySQL data source.

    Configure an ApsaraDB RDS for MySQL in DataWorks. For more information, see Add a MySQL data source.

  3. Prepare an OSS data source.

    Configure an OSS data source in DataWorks. For more information, see Add an OSS data source.

  4. Create and run a data synchronization task.

    On the DataStudio page in the DataWorks console, create a batch synchronization task. For more information, see Configure a batch synchronization task by using the codeless UI. The following content describes the key parameters that you must configure.

    1. Configure network connections and resource groups.image..png

      Parameter

      Description

      Source

      Source

      Select MySQL.

      Data Source Name

      Select the name of the ApsaraDB RDS for MySQL data source that you created.

      Resource Group

      Resource Group

      Select the exclusive resource group for Data Integration.

      Destination

      Destination

      Select OSS.

      Data Source Name

      Select the created OSS data source.

    2. Configure a task.

      In the Configure tasks step, specify the Table and File name (including path) parameters.

      Parameter

      Description

      Table

      Select the name of the table that is created in the ApsaraDB RDS database.

      File name (including path)

      The file name must be in the <File directory created in OSS>/<Name of the data file to be exported to OSS> format.

      Example: doc-test-01/datalake/anti.csv.

    3. Click the image..png icon in the upper-left corner of the configuration page of the batch synchronization task to save the configuration, and then click the image..png icon to run the task.

    4. After the task is successfully run in DataWorks, you can check whether the data is imported to the OSS directory.

Step 3: Use DLF to extract metadata and ingest the metadata into a data lake

Log on to the DLF console, create an extraction task to extract metadata and ingest the metadata into a data lake. For more information, see Metadata discovery.

Step 4: View the metadata in the data lake

Log on to the DLF console. In the left-side navigation pane, choose Metadata > Metadata. On the Metadata page, click the Database tab. On the Database tab, find the desired database and click Table in the Actions column. On the Table tab, view information about the desired table.

Important

If the serialization method of the table whose metadata is extracted is org.apache.hadoop.hive.serde2.OpenCSVSerdec, MaxCompute identifies the metadata fields of the STRING data type supported by OpenCSV. In this case, the related query fails, and an error is reported. To resolve this issue, you must manually change the data type of the identified metadata fields to STRING.

Grant access permissions

If you want to build the data lakehouse by using MaxCompute, DLF, and OSS, complete authorization by using one of the following methods. The account that is used to create the MaxCompute project cannot access DLF without authorization. You can use one of the following methods to authorize MaxCompute to access DLF:

  • One-click authorization: If you use the same account to create the MaxCompute project and deploy DLF, we recommend that you perform one-click authorization on the Cloud Resource Access Authorization page in the Resource Access Management (RAM) console.

  • Custom authorization: You can use this method regardless of whether the same account is used to create the MaxCompute project and deploy Data Lake Formation (DLF). For more information, see Authorize a RAM user to access DLF.

Create an external project for MaxCompute

Create an external project in the DataWorks console.

  1. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region.

  2. In the left-side navigation pane of the DataWorks console, choose More > Lake and Warehouse Integration (Data Lakehouse).

  3. On the Lake and Warehouse Integration (Data Lakehouse) page, click Start.

  4. On the Create Data Lakehouse page, configure the parameters. The following tables describe the parameters.

    Table 1. Parameters in the Create Data Warehouse step

    Parameter

    Description

    External Project Name

    ext_dlf_delta

    MaxCompute Project

    ms_proj1

    Table 2. Parameters in the Create Data Lake Connection step

    Parameter

    Description

    Heterogeneous Data Platform Type

    Select Alibaba Cloud DLF + OSS from the drop-down list.

    None

    Alibaba Cloud DLF + OSS

    External Project Description

    None

    Region Where DLF Is Activated

    cn-shanghai

    DLF Endpoint

    dlf-share.cn-shanghai.aliyuncs.com

    DLF Database Name

    datalake

    DLF RoleARN

    None

  5. Click Create. On the page that appears, click Preview.

    If you can preview the table information in the DLF database, the operation is successful.

Query data from the external project

In the Ad Hoc Query pane of the DataStudio page in the DataWorks console, query table data from the external project.

Note

For more information about the ad hoc query feature provided by DataWorks, see Use an ad hoc query node to execute SQL statements (Optional).

  • Sample statement:

    select * from ext_dlf_delta.rds_mc;
  • The following figure shows the returned result.image..png