When OLTP databases grow large, teams often shard data across multiple databases and tables for throughput—but this makes unified analysis difficult. Data Lake Formation (DLF) integrates with Realtime Compute for Apache Flink (built on the Ververica Platform (VVP)) and Flink Change Data Capture (CDC) to consolidate that data in a lake in real time. DLF then provides centralized metadata management, so you can query the same data from multiple analytics engines—such as E-MapReduce (EMR), MaxCompute, and Hologres—without moving it again.
This tutorial walks you through an end-to-end pipeline: capture changes from a MySQL database with Flink CDC, write them to an Apache Hudi result table, sync the table metadata to a DLF catalog, and query the data lake with Flink SQL—all without writing any Java or Scala code.
How it works
Realtime Compute for Apache Flink reads change events from a MySQL source using the mysql-cdc connector. Flink writes those changes to an Apache Hudi result table and simultaneously syncs the table schema to a DLF catalog. Once metadata is registered in DLF, you can query the Hudi table directly from the Flink SQL console or from connected compute engines. DLF also handles data lake lifecycle management and lake format optimization, keeping data accessible and cost-efficient over time.
Prerequisites
Before you begin, make sure you have:
Realtime Compute for Apache Flink activated, with a fully managed Flink workspace created. See Get started with Realtime Compute for Apache Flink.
DLF activated. If not yet activated, go to the Data Lake Formation product page and click Activate Now.
An ApsaraDB RDS for MySQL instance in the same region and VPC as the Realtime Compute for Apache Flink workspace, running MySQL 5.7 or later. For setup instructions, see Create an ApsaraDB RDS for MySQL instance. Skip this item if you use a different source database.
Step 1: Prepare MySQL data
Log in to the ApsaraDB RDS for MySQL instance. See Connect to an ApsaraDB RDS for MySQL instance.
Run the following SQL to create a test database and table, then insert sample data.
CREATE DATABASE testdb; CREATE TABLE testdb.student ( `id` bigint(20) NOT NULL, `name` varchar(256) DEFAULT NULL, `age` bigint(20) DEFAULT NULL, PRIMARY KEY (`id`) ); INSERT INTO testdb.student VALUES (1,'name1',10); INSERT INTO testdb.student VALUES (2,'name2',20);
Step 2: Create a DLF catalog in Flink
Log in to the Realtime Compute for Apache Flink console.
On the Fully Managed Flink tab, click Console in the Actions column of the target workspace.
In the left-side navigation pane, click Catalogs, then click Create Catalog.
On the Create Catalog page, select DLF and click Next.
Enter the catalog configuration, then click Confirm. For parameter details, see Manage DLF catalogs.

After the catalog is created, it appears as dlf under Catalogs. This is the default data catalog for DLF.

Step 3: Create a Flink data lake job
Create the source and result tables
In the left-side navigation pane, click Development > Scripts.
In the SQL editing area, enter the following SQL and click Run.
-- Create a source table that reads change events from MySQL CREATE TABLE IF NOT EXISTS student_source ( id INT, name VARCHAR(256), age INT, PRIMARY KEY (id) NOT ENFORCED ) WITH ( 'connector' = 'mysql-cdc', -- Replace with the endpoint of your ApsaraDB RDS for MySQL instance 'hostname' = 'rm-xxxxxxxx.mysql.rds.aliyuncs.com', 'port' = '3306', 'username' = '<RDS username>', 'password' = '<RDS password>', 'database-name' = '<RDS database>', -- Set to the name of the source table created in Step 1 'table-name' = 'student' ); -- Create the target database in the DLF catalog -- Replace 'dlf' with your DLF catalog name if different CREATE DATABASE IF NOT EXISTS dlf.dlf_testdb; -- Create the Apache Hudi result table in the DLF catalog CREATE TABLE IF NOT EXISTS dlf.dlf_testdb.student_hudi ( id BIGINT PRIMARY KEY NOT ENFORCED, name STRING, age BIGINT ) WITH ( 'connector' = 'hudi' );Placeholder Description Example rm-xxxxxxxx.mysql.rds.aliyuncs.comEndpoint of the ApsaraDB RDS for MySQL instance rm-bp1xxxxxxxx.mysql.rds.aliyuncs.com<RDS username>MySQL user with read access to the source table admin<RDS password>Password for the MySQL user — <RDS database>Name of the source database testdbAfter the tables are created, both appear under Catalogs.

For the full list of
mysql-cdcconnector parameters, see MySQL source connector. For Hudi result table parameters, see Hudi connector (to be retired).
Create and deploy the streaming job
In the left-side navigation pane, click Development > ETL.
Click New, select Blank Stream Draft in the New Draft dialog box, and click Next.
Enter the draft configuration and click Create.
In the SQL editing area, enter the following INSERT statement.
-- Stream changes from the MySQL source into the Hudi result table INSERT INTO dlf.dlf_testdb.student_hudi SELECT * FROM student_source /*+ OPTIONS('server-id'='123456') */;In the upper-right corner of the SQL editing area, click Deploy. In the Deploy draft dialog box, fill in the required fields and click Confirm.
Start the job
In the left-side navigation pane, click O&M > Deployments.
In the Actions column of the target job, click Start.
Select Initial Mode and click Start.
When the job status changes to RUNNING, Flink is actively capturing changes from MySQL and writing them to the Hudi result table. For startup parameter details, see Start a deployment.
Step 4: Verify the data lake
In the left-side navigation pane, click Development > Scripts.
Run the following query to confirm the initial rows were written to the data lake.
SELECT * FROM dlf.dlf_testdb.student_hudi;The result shows the two rows inserted in Step 1.

If you have an EMR cluster with DLF metadata enabled, you can also query the Hudi table through the EMR cluster. See Integrate Hudi with Spark SQL.
What's next
To read from and write to Hudi tables in DLF using Flink in an EMR DataFlow cluster, see Use a Dataflow cluster to read data from and write data to Hudi tables based on DLF.