Ingest data from ApsaraDB RDS into Object Storage Service (OSS) in Delta Lake format, then query and verify real-time data synchronization through Data Lake Formation (DLF).
Prerequisites
Before you begin, make sure that you have:
Activated the following services: DLF, OSS, Databricks DataInsight (DDI), RDS, and Data Transmission Service (DTS)
An ApsaraDB RDS for MySQL instance
Data Management (DMS) access to run SQL statements against RDS
DLF, RDS, and DDI instances must be in the same region.
Step 1: Prepare source data in RDS
Create a database named
dlf-demoin your RDS instance.Create a user account (for example,
dlf_admin) with read access to thedlf-demodatabase.Log on to the database through DMS and run the following SQL to create the
employeestable:CREATE TABLE `employees` ( `emp_no` int(11) NOT NULL, `birth_date` date NOT NULL, `first_name` varchar(14) NOT NULL, `last_name` varchar(16) NOT NULL, `gender` enum('M','F') NOT NULL, `hire_date` date NOT NULL, `create_time` DATETIME NOT NULL, `update_time` DATETIME NOT NULL, PRIMARY KEY (`emp_no`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8Insert sample data:
INSERT INTO `employees` VALUES (10001,'1953-09-02','Georgi','Facello','M','1986-06-26', now(), now()); INSERT INTO `employees` VALUES (10002,'1964-06-02','Bezalel','Simmel','F','1985-11-21', now(), now());
Step 2: Create a data source
Go to the DLF console.
In the left-side navigation pane, choose Data Ingestion > Data Source Management.
Click Create Data Source.
Enter the connection name, select the RDS instance created in Step 1, and enter the account credentials.
Click Connection Test to verify network connectivity and account availability.
Click Next, then Confirm to complete the data source creation.
Step 3: Create a metadatabase
Create a bucket named
dlf-demoin OSS.In the DLF console left-side navigation pane, choose Metadata Management > Metadatabase.
Click Create Metadatabase.
Enter a name for the metadatabase, create a directory named
delta-testin the OSS bucket, and select it as the storage location.
Step 4: Create and run an ingestion task
In the left-side navigation pane, choose Data Ingestion > Ingestion Task Management.
Click Create Ingestion Task.
Select Relational Database Real-time Ingestion, fill in the data source, target data lake, task configuration, and other information. Save it.
Configure the data source, select the newly created
dlf_democonnection, use the table pathdlf_demo/employees, select to create a new DTS subscription, and fill in the name.On the task management page, click Run on the newly created ingestion task. The task transitions from Initializing to Running status.
Click Details to view the task details and associated database table information.
This ingestion task performs both full and incremental ingestion. The full data import takes approximately 3 to 5 minutes. After the full import completes, the task automatically enters a real-time listening state and propagates subsequent data changes to the Delta Lake tables.
Step 5: Query data in the data lake
In the left-side navigation pane, choose Data Exploration > SQL Query.
In the metadatabase tree, expand
dlf_demo_delta. Theemployeestable appears, automatically created by the ingestion task.Double-click the table name. A SELECT query for the table populates the SQL editor.
Click Run to execute the query and view the results.
Step 6: Verify real-time data synchronization
Return to the DMS console and run the following SQL statements to modify the source data:
UPDATE `employees` SET `first_name` = 'dlf-demo', `update_time` = now() WHERE `emp_no` = 10001; DELETE FROM `employees` WHERE `emp_no` = 10002; INSERT INTO `employees` VALUES (10011,'1953-11-07','dlf-insert','Sluis','F','1990-01-22', now(), now());After approximately 1 to 3 minutes, run the SELECT query from Step 5 again in the DLF Data Exploration > SQL Query page, and all data changes have been synchronized to the data lake.
Clean up resources
After completing the tutorial, delete the following resources to avoid ongoing charges:
In the DLF console, stop and delete the ingestion task.
Cancel the DTS subscription created for the ingestion task.
Delete the
dlf-demoOSS bucket, or remove thedelta-testdirectory if the bucket is shared.Drop the
dlf-demodatabase in RDS, or delete the RDS instance if it was created specifically for this tutorial.
Next steps
Use DDI (Spark) to run advanced analytics against the Delta Lake tables in OSS.
Connect additional compute engines such as Presto, Flink, MaxCompute, or Hologres to the DLF catalog.
Set up additional ingestion tasks for other data sources to expand your data lake.