End-to-end data lake formation and analysis practice based on Data Lake Formation (DLF) - Data Lake Formation

Ingest data from ApsaraDB RDS into Object Storage Service (OSS) in Delta Lake format, then query and verify real-time data synchronization through Data Lake Formation (DLF).

Prerequisites

Before you begin, make sure that you have:

Activated the following services: DLF, OSS, Databricks DataInsight (DDI), RDS, and Data Transmission Service (DTS)
An ApsaraDB RDS for MySQL instance
Data Management (DMS) access to run SQL statements against RDS

Important

DLF, RDS, and DDI instances must be in the same region.

Step 1: Prepare source data in RDS

Create a database named dlf-demo in your RDS instance.
Create a user account (for example, dlf_admin) with read access to the dlf-demo database.

Log on to the database through DMS and run the following SQL to create the employees table:

   CREATE TABLE `employees` (
     `emp_no` int(11) NOT NULL,
     `birth_date` date NOT NULL,
     `first_name` varchar(14) NOT NULL,
     `last_name` varchar(16) NOT NULL,
     `gender` enum('M','F') NOT NULL,
     `hire_date` date NOT NULL,
     `create_time` DATETIME NOT NULL,
     `update_time` DATETIME NOT NULL,
     PRIMARY KEY (`emp_no`)
   ) ENGINE=InnoDB DEFAULT CHARSET=utf8

Insert sample data:

   INSERT INTO `employees` VALUES (10001,'1953-09-02','Georgi','Facello','M','1986-06-26', now(), now());
   INSERT INTO `employees` VALUES (10002,'1964-06-02','Bezalel','Simmel','F','1985-11-21', now(), now());

Step 2: Create a data source

Go to the DLF console.
In the left-side navigation pane, choose Data Ingestion > Data Source Management.
Click Create Data Source.
Enter the connection name, select the RDS instance created in Step 1, and enter the account credentials.
Click Connection Test to verify network connectivity and account availability.
Click Next, then Confirm to complete the data source creation.

Step 3: Create a metadatabase

Create a bucket named dlf-demo in OSS.
In the DLF console left-side navigation pane, choose Metadata Management > Metadatabase.
Click Create Metadatabase.
Enter a name for the metadatabase, create a directory named delta-test in the OSS bucket, and select it as the storage location.

Step 4: Create and run an ingestion task

In the left-side navigation pane, choose Data Ingestion > Ingestion Task Management.
Click Create Ingestion Task.
Select Relational Database Real-time Ingestion, fill in the data source, target data lake, task configuration, and other information. Save it.
Configure the data source, select the newly created dlf_demo connection, use the table path dlf_demo/employees, select to create a new DTS subscription, and fill in the name.
On the task management page, click Run on the newly created ingestion task. The task transitions from Initializing to Running status.
Click Details to view the task details and associated database table information.

This ingestion task performs both full and incremental ingestion. The full data import takes approximately 3 to 5 minutes. After the full import completes, the task automatically enters a real-time listening state and propagates subsequent data changes to the Delta Lake tables.

Step 5: Query data in the data lake

In the left-side navigation pane, choose Data Exploration > SQL Query.
In the metadatabase tree, expand dlf_demo_delta. The employees table appears, automatically created by the ingestion task.
Double-click the table name. A SELECT query for the table populates the SQL editor.
Click Run to execute the query and view the results.

Step 6: Verify real-time data synchronization

Return to the DMS console and run the following SQL statements to modify the source data:

   UPDATE `employees` SET `first_name` = 'dlf-demo', `update_time` = now() WHERE `emp_no` = 10001;
   DELETE FROM `employees` WHERE `emp_no` = 10002;
   INSERT INTO `employees` VALUES (10011,'1953-11-07','dlf-insert','Sluis','F','1990-01-22', now(), now());

After approximately 1 to 3 minutes, run the SELECT query from Step 5 again in the DLF Data Exploration > SQL Query page, and all data changes have been synchronized to the data lake.

Clean up resources

After completing the tutorial, delete the following resources to avoid ongoing charges:

In the DLF console, stop and delete the ingestion task.
Cancel the DTS subscription created for the ingestion task.
Delete the dlf-demo OSS bucket, or remove the delta-test directory if the bucket is shared.
Drop the dlf-demo database in RDS, or delete the RDS instance if it was created specifically for this tutorial.

Next steps

Use DDI (Spark) to run advanced analytics against the Delta Lake tables in OSS.
Connect additional compute engines such as Presto, Flink, MaxCompute, or Hologres to the DLF catalog.
Set up additional ingestion tasks for other data sources to expand your data lake.