The data ingestion feature of Data Lake Formation (DLF) is no longer updated. This topic describes how to use DataWorks Data Integration to ingest MySQL data into a data lake, create an external project for MaxCompute, and query table data in DLF from the external project.
Limits
Data lakehouse solutions are supported only in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).
MaxCompute, Object Storage Service (OSS), and ApsaraDB RDS for MySQL must be deployed in the same region as DLF.
Import MySQL data into a data lake
For more information about how to use DLF to ingest data into a data lake, see Getting Started.
Step 1: Create a metadatabase of a data lake
Log on to the DLF console. In the top navigation bar, select a region. In the left-side navigation pane, choose . On the Metadata page, create a metadatabase. For more information, see Create a metadatabase.
Step 2: Use DataWorks Data Integration to ingest data to OSS
Prepare the data that you want to ingest.
Log on to the ApsaraDB RDS console. In the top navigation bar, select a region. In the left-side navigation pane, click Instances.
On the Instances page, find the desired ApsaraDB RDS for MySQL instance and log on to the desired database.
Create a table in the database and insert a small amount of test data into the table. For more information, see Use DMS to log on to an ApsaraDB RDS for MySQL instance. For example, you can execute the following statement to create a table named rds_mc:
CREATE TABLE `rds_mc` ( `id` varchar(32) , `name` varchar(32) , PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(1,"Bob"); INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(2,"zhangsan"); INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(3,"zhaosi"); INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(4,"wangwu"); INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(5,"55555"); INSERT INTO `rds_mc`(`id` ,`name` ) VALUES(8,"6666"); SELECT * FROM `rds_mc`;
Prepare an ApsaraDB RDS for MySQL data source.
Configure an ApsaraDB RDS for MySQL in DataWorks. For more information, see Add a MySQL data source.
Prepare an OSS data source.
Configure an OSS data source in DataWorks. For more information, see Add an OSS data source.
Create and run a data synchronization task.
On the DataStudio page in the DataWorks console, create a batch synchronization task. For more information, see Configure a batch synchronization task by using the codeless UI. The following content describes the key parameters that you must configure.
Configure network connections and resource groups.
Parameter
Description
Source
Source
Select MySQL.
Data Source Name
Select the name of the ApsaraDB RDS for MySQL data source that you created.
Resource Group
Resource Group
Select the exclusive resource group for Data Integration.
Destination
Destination
Select OSS.
Data Source Name
Select the created OSS data source.
Configure a task.
In the Configure tasks step, specify the Table and File name (including path) parameters.
Parameter
Description
Table
Select the name of the table that is created in the ApsaraDB RDS database.
File name (including path)
The file name must be in the <File directory created in OSS>/<Name of the data file to be exported to OSS> format.
Example:
doc-test-01/datalake/anti.csv
.Click the
icon in the upper-left corner of the configuration page of the batch synchronization task to save the configuration, and then click the
icon to run the task.
After the task is successfully run in DataWorks, you can check whether the data is imported to the OSS directory.
Step 3: Use DLF to extract metadata and ingest the metadata into a data lake
Log on to the DLF console, create an extraction task to extract metadata and ingest the metadata into a data lake. For more information, see Metadata discovery.
Step 4: View the metadata in the data lake
Log on to the DLF console. In the left-side navigation pane, choose Metadata > Metadata. On the Metadata page, click the Database tab. On the Database tab, find the desired database and click Table in the Actions column. On the Table tab, view information about the desired table.
If the serialization method of the table whose metadata is extracted is org.apache.hadoop.hive.serde2.OpenCSVSerdec
, MaxCompute identifies the metadata fields of the STRING data type supported by OpenCSV. In this case, the related query fails, and an error is reported. To resolve this issue, you must manually change the data type of the identified metadata fields to STRING.
Grant access permissions
If you want to build the data lakehouse by using MaxCompute, DLF, and OSS, complete authorization by using one of the following methods. The account that is used to create the MaxCompute project cannot access DLF without authorization. You can use one of the following methods to authorize MaxCompute to access DLF:
One-click authorization: If you use the same account to create the MaxCompute project and deploy DLF, we recommend that you perform one-click authorization on the Cloud Resource Access Authorization page in the Resource Access Management (RAM) console.
Custom authorization: You can use this method regardless of whether the same account is used to create the MaxCompute project and deploy Data Lake Formation (DLF). For more information, see Authorize a RAM user to access DLF.
Create an external project for MaxCompute
Create an external project in the DataWorks console.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region.
In the left-side navigation pane of the DataWorks console, choose .
On the Lake and Warehouse Integration (Data Lakehouse) page, click Start.
On the Create Data Lakehouse page, configure the parameters. The following tables describe the parameters.
Table 1. Parameters in the Create Data Warehouse step Parameter
Description
External Project Name
ext_dlf_delta
MaxCompute Project
ms_proj1
Table 2. Parameters in the Create Data Lake Connection step Parameter
Description
Heterogeneous Data Platform Type
Select Alibaba Cloud DLF + OSS from the drop-down list.
None
Alibaba Cloud DLF + OSS
External Project Description
None
Region Where DLF Is Activated
cn-shanghai
DLF Endpoint
dlf-share.cn-shanghai.aliyuncs.com
DLF Database Name
datalake
DLF RoleARN
None
Click Create. On the page that appears, click Preview.
If you can preview the table information in the DLF database, the operation is successful.
Query data from the external project
In the Ad Hoc Query pane of the DataStudio page in the DataWorks console, query table data from the external project.
For more information about the ad hoc query feature provided by DataWorks, see Use an ad hoc query node to execute SQL statements (Optional).
Sample statement:
select * from ext_dlf_delta.rds_mc;
The following figure shows the returned result.