All Products
Search
Document Center

Data Lake Formation:End-to-end data lake formation and analysis practice based on Delta Lake

Last Updated:Apr 24, 2025

Enterprises generally need to go through several processes such as data ingestion, data lake storage and management, and data lake exploration and analysis when building and applying data lakes. This topic describes the end-to-end data ingestion and analysis practice based on Alibaba Cloud Data Lake Formation (DLF).

Background information

With the continuous development of the data era, the volume of data is growing explosively, and the forms of data are becoming more diverse. The traditional data warehouse model is increasingly highlighting issues such as high cost, slow response, and limited formats. Thus, data lakes, which have lower costs, richer data forms, and more flexible analysis and computation, have emerged.

As a centralized data storage repository, data lakes support a variety of data types, including structured, semi-structured, and unstructured data. Data sources include database data, binlog incremental data, log data, and historical data from existing data warehouses. Data lakes can centrally store and manage these different sources and formats of data in cost-effective storage such as Object Storage Service (OSS) and provide a unified data catalog for external use. They support various computation and analysis methods, effectively solving the data silo problem faced by enterprises while significantly reducing the cost of storing and using data.

Enterprise-level data lake architecture

Data lake storage and format

Data lake storage primarily uses OSS as the main medium, which has advantages such as low cost, high stability, and high scalability.

In data lakes, we can use data lake storage formats that support ACID, such as Delta Lake, Hudi, and Iceberg. These data lake formats have their own data meta-management capabilities and can support operations like Update and Delete. They solve the problem of real-time data updates in big data scenarios in a batch-stream integrated manner.

Data lake formation and management

1. Data ingestion

The raw data of enterprises exists in various databases or storage systems, such as relational databases like MySQL, log systems like SLS, NoSQL storage like HBase, and message databases like Kafka. Most of the online storage is oriented towards online transactional business and is not suitable for online analysis scenarios. Therefore, data needs to be synchronized to cost-effective object storage that is more suitable for computation and analysis in a non-intrusive manner.

Common data synchronization methods include batch synchronization using data synchronization tools like DataX and Sqoop. In scenarios with high real-time requirements, Kafka + Spark Streaming / Flink streaming synchronization links are used. Many cloud vendors provide one-stop data ingestion solutions to help customers achieve data ingestion more quickly and at a lower cost, such as Alibaba Cloud DLF data ingestion.

2. Unified metadata service

Object storage itself does not have semantics for big data analysis and needs to be combined with metadata services like Hive Metastore Service to provide meta information for various upper-layer analysis engines.

Data lake computation and analysis

Compared to data warehouses, data lakes connect with various compute engines in a more open manner, such as traditional open-source big data compute engines like Hive, Spark, Presto, and Flink. They also support big data engines developed by cloud vendors, such as Alibaba Cloud MaxCompute and Hologres. Between data lake storage and compute engines, data lake acceleration services are generally provided to improve computation and analysis performance while reducing bandwidth costs and pressure.

Flowchart

Data lake formation and analysis link

Enterprises generally need to go through several processes such as data ingestion, data lake storage and management, and data lake exploration and analysis when building and applying data lakes. This topic describes the end-to-end data ingestion and analysis practice based on DLF.

The main data link is as follows:

Step 1: Service activation and data preparation

1. Service activation

Ensure that cloud product services such as DLF, OSS, DDI, RDS, and DTS are activated. Note that DLF, RDS, and DDI instances must be in the same region.

2. Data preparation

Prepare RDS data by creating a database named dlf-demo in RDS. Create a user account in the account center that can read the employees database, such as dlf_admin.

Log on to the database through DMS and run the following statements to create the employees table and insert a small amount of data.

CREATE TABLE `employees` (
  `emp_no` int(11) NOT NULL,
  `birth_date` date NOT NULL,
  `first_name` varchar(14) NOT NULL,
  `last_name` varchar(16) NOT NULL,
  `gender` enum('M','F') NOT NULL,
  `hire_date` date NOT NULL,
  `create_time` DATETIME NOT NULL,
  `update_time` DATETIME NOT NULL,
  PRIMARY KEY (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

INSERT INTO `employees` VALUES (10001,'1953-09-02','Georgi','Facello','M','1986-06-26', now(), now());
INSERT INTO `employees` VALUES (10002,'1964-06-02','Bezalel','Simmel','F','1985-11-21', now(), now());

Step 2: Data ingestion

1. Create a data source

a. Go to the DLF console, click the menu "Data Ingestion -> Data Source Management".

b. Click "Create Data Source". Fill in the connection name, select the RDS instance used in data preparation, fill in the account password, and click "Connection Test" to verify network connectivity and account availability.

c. Click Next, Confirm, and complete the data source creation.

2. Create a metadatabase

a. Create a new bucket in OSS, dlf-demo;

b. Click the left-side navigation pane, choose "Metadata Management" -> "Metadatabase", click "Create Metadatabase". Fill in the name, create a directory delta-test, and select it.

3. Create an ingestion task

a. Click the menu "Data Ingestion" -> "Ingestion Task Management", click "Create Ingestion Task".

b. Select "Relational Database Real-time Ingestion", fill in the data source, target data lake, task configuration, and other information. Save it.

c. Configure the data source, select the newly created "dlf_demo" connection, use the table path "dlf_demo/employees", select to create a new DTS subscription, and fill in the name.

d. Return to the task management page, click "Run" on the newly created ingestion task. You will see the task enter the "Initializing" status, and then it will enter the "Running" status.

e. Click "Details" to enter the task details page, where you can see the relevant database table information.

This data ingestion task involves full and incremental ingestion. After about 3 to 5 minutes, the full data will be imported, and then it will automatically enter the real-time listening status. If there is a data update, it will be automatically updated to the Delta Lake data.

Step 3: Data lake exploration and analysis

The DLF product provides lightweight data preview and exploration functions. Click the menu "Data Exploration" -> "SQL Query" to enter the data query page.

a. In the metadatabase table, find "dlf_demo_delta". After expanding, you can see that the employees table has been automatically created. Double-click the table name, and the SQL query statement for the table will appear in the SQL editor on the right. Click "Run" to get the data query result.

b. Return to the DMS console and run the following update, delete, and insert SQL statements.

update `employees` set `first_name` = 'dlf-demo', `update_time` = now() where `emp_no` =10001;
delete FROM `employees` where `emp_no` = 10002;
INSERT INTO `employees` VALUES (10011,'1953-11-07','dlf-insert','Sluis','F','1990-01-22', now(), now());

c. After about 1 to 3 minutes, execute the previous select statement again in DLF Data Exploration, and all data updates have been synchronized to the data lake.