All Products
Search
Document Center

Data Lake Formation:Use Flink and DLF to import data to data lakes and analyze the data

Last Updated:Mar 26, 2026

When OLTP databases grow large, teams often shard data across multiple databases and tables for throughput—but this makes unified analysis difficult. Data Lake Formation (DLF) integrates with Realtime Compute for Apache Flink (built on the Ververica Platform (VVP)) and Flink Change Data Capture (CDC) to consolidate that data in a lake in real time. DLF then provides centralized metadata management, so you can query the same data from multiple analytics engines—such as E-MapReduce (EMR), MaxCompute, and Hologres—without moving it again.

This tutorial walks you through an end-to-end pipeline: capture changes from a MySQL database with Flink CDC, write them to an Apache Hudi result table, sync the table metadata to a DLF catalog, and query the data lake with Flink SQL—all without writing any Java or Scala code.

How it works

Architecture diagram showing MySQL → Flink CDC → Apache Hudi result table → DLF catalog → analytics engines

Realtime Compute for Apache Flink reads change events from a MySQL source using the mysql-cdc connector. Flink writes those changes to an Apache Hudi result table and simultaneously syncs the table schema to a DLF catalog. Once metadata is registered in DLF, you can query the Hudi table directly from the Flink SQL console or from connected compute engines. DLF also handles data lake lifecycle management and lake format optimization, keeping data accessible and cost-efficient over time.

Prerequisites

Before you begin, make sure you have:

Step 1: Prepare MySQL data

  1. Log in to the ApsaraDB RDS for MySQL instance. See Connect to an ApsaraDB RDS for MySQL instance.

  2. Run the following SQL to create a test database and table, then insert sample data.

    CREATE DATABASE testdb;
    CREATE TABLE testdb.student (
      `id` bigint(20) NOT NULL,
      `name` varchar(256) DEFAULT NULL,
      `age` bigint(20) DEFAULT NULL,
      PRIMARY KEY (`id`)
    );
    
    INSERT INTO testdb.student VALUES (1,'name1',10);
    INSERT INTO testdb.student VALUES (2,'name2',20);

Step 2: Create a DLF catalog in Flink

  1. Log in to the Realtime Compute for Apache Flink console.

  2. On the Fully Managed Flink tab, click Console in the Actions column of the target workspace.

  3. In the left-side navigation pane, click Catalogs, then click Create Catalog.

  4. On the Create Catalog page, select DLF and click Next.

  5. Enter the catalog configuration, then click Confirm. For parameter details, see Manage DLF catalogs.

    Create Catalog configuration page

After the catalog is created, it appears as dlf under Catalogs. This is the default data catalog for DLF.

Catalogs pane showing the dlf catalog

Step 3: Create a Flink data lake job

Create the source and result tables

  1. In the left-side navigation pane, click Development > Scripts.

  2. In the SQL editing area, enter the following SQL and click Run.

    -- Create a source table that reads change events from MySQL
    CREATE TABLE IF NOT EXISTS student_source (
      id INT,
      name VARCHAR(256),
      age INT,
      PRIMARY KEY (id) NOT ENFORCED
    )
    WITH (
      'connector' = 'mysql-cdc',
      -- Replace with the endpoint of your ApsaraDB RDS for MySQL instance
      'hostname' = 'rm-xxxxxxxx.mysql.rds.aliyuncs.com',
      'port' = '3306',
      'username' = '<RDS username>',
      'password' = '<RDS password>',
      'database-name' = '<RDS database>',
      -- Set to the name of the source table created in Step 1
      'table-name' = 'student'
    );
    
    -- Create the target database in the DLF catalog
    -- Replace 'dlf' with your DLF catalog name if different
    CREATE DATABASE IF NOT EXISTS dlf.dlf_testdb;
    
    -- Create the Apache Hudi result table in the DLF catalog
    CREATE TABLE IF NOT EXISTS dlf.dlf_testdb.student_hudi (
      id    BIGINT PRIMARY KEY NOT ENFORCED,
      name  STRING,
      age   BIGINT
    ) WITH (
      'connector' = 'hudi'
    );
    PlaceholderDescriptionExample
    rm-xxxxxxxx.mysql.rds.aliyuncs.comEndpoint of the ApsaraDB RDS for MySQL instancerm-bp1xxxxxxxx.mysql.rds.aliyuncs.com
    <RDS username>MySQL user with read access to the source tableadmin
    <RDS password>Password for the MySQL user
    <RDS database>Name of the source databasetestdb

    After the tables are created, both appear under Catalogs.

    Catalogs pane showing the newly created source and result tables

    For the full list of mysql-cdc connector parameters, see MySQL source connector. For Hudi result table parameters, see Hudi connector (to be retired).

Create and deploy the streaming job

  1. In the left-side navigation pane, click Development > ETL.

  2. Click New, select Blank Stream Draft in the New Draft dialog box, and click Next.

  3. Enter the draft configuration and click Create.

  4. In the SQL editing area, enter the following INSERT statement.

    -- Stream changes from the MySQL source into the Hudi result table
    INSERT INTO dlf.dlf_testdb.student_hudi
    SELECT * FROM student_source /*+ OPTIONS('server-id'='123456') */;
  5. In the upper-right corner of the SQL editing area, click Deploy. In the Deploy draft dialog box, fill in the required fields and click Confirm.

Start the job

  1. In the left-side navigation pane, click O&M > Deployments.

  2. In the Actions column of the target job, click Start.

  3. Select Initial Mode and click Start.

When the job status changes to RUNNING, Flink is actively capturing changes from MySQL and writing them to the Hudi result table. For startup parameter details, see Start a deployment.

Step 4: Verify the data lake

  1. In the left-side navigation pane, click Development > Scripts.

  2. Run the following query to confirm the initial rows were written to the data lake.

    SELECT * FROM dlf.dlf_testdb.student_hudi;

    The result shows the two rows inserted in Step 1.

    Query results showing the two initial rows

If you have an EMR cluster with DLF metadata enabled, you can also query the Hudi table through the EMR cluster. See Integrate Hudi with Spark SQL.

What's next

To read from and write to Hudi tables in DLF using Flink in an EMR DataFlow cluster, see Use a Dataflow cluster to read data from and write data to Hudi tables based on DLF.