×
Community Blog All-in-one Lake Migration of Multiple Data Sources

All-in-one Lake Migration of Multiple Data Sources

This article briefly discusses Alibaba Cloud Data Lake Formation (DLF) service and explains how it solves the data migration challenges during lake migration of data from heterogeneous data sources.

Background

A data lake supports various data types, including structured, semi-structured, and unstructured data, as a centralized data storage warehouse. The data sources include database data, binlog incremental data, log data, and stock data in existing data warehouses. For these data, a data lake enables centralized storage and management in cost-effective object storage services such as OSS. It also provides a unified data analysis method for external systems. This effectively solves the problem of "data islands" and high data storage costs for enterprises.

1

Enterprises face difficulty in migrating data from heterogeneous data sources to centralized data lake storage during a data lake formation. This happens due to the presence of a variety of data sources in a data lake. A complete and all-in-one lake migration solution can solve these problems, mainly including the following aspects:

  • Support unified lake migration of data from heterogeneous data sources

A simple and unified solution enables easy data migration to the lake. You can implement lake migration of data from heterogeneous data sources through simple page configurations.

  • Ensure the timeliness of lake migration

For data from sources such as logs and binlogs, the lake migration must ensure only minute-level latency to realize the timeliness for real-time interactive analysis scenarios.

  • Support real-time data source changes

For data sources such as databases and Tablestore tunnels, the source data may frequently change due to operations like data update and delete and field structure changes in the schema. Therefore, better data formats are needed to support these changes.

Alibaba Cloud has launched Data Lake Formation (DLF) service, providing a complete lake migration solution.

Overall Solution

The following figure shows the technical lake migration solution of DLF:

2

Lake migration solution consists of lake migration template, lake migration engine, file format, and data lake storage.

Lake Migration Template

The lake migration template defines common migration ways. Currently, there are five templates: RDS full template, DTS incremental template, Tablestore template, SLS template, and file format conversion.

3

You can select the corresponding template according to different data sources, then fill in the source-related parameters to complete the template creation, and finally, submit it to the lake migration engine for running.

Lake Migration Engine

Spark Streaming SQL and EMR Spark engine developed by the Alibaba Cloud E-MapReduce (EMR) team are used in the lake migration engine. Streaming SQL is based on Spark Structured Streaming and provides a complete set of Streaming SQL syntax. This improves the development cost of real-time computing. You can translate the template upper part into Streaming SQL for the real-time incremental template for real-time incremental template and submit it to the Spark cluster for running. You can also extend the Merge Into syntax in Streaming SQL to support update and delete operations. Relational Database Service (RDS) and other full templates are directly translated into Spark SQL for running.

File Format

DLF supports several file formats, including Delta Lake, Parquet, and JSON. It also supports file formats such as Hudi. Delta Lake and Hudi provide excellent support for update and delete operations, as well as schema merge. This feature can effectively handle the issue of real-time data source change.

Data Lake Storage

OSS stores the data in a data lake. OSS is capable of storing large amounts of data and is better in reliability and price.

The all-in-one lake migration solution solves the problems above.

  • Support unified lake migration of data from heterogeneous data sources

Through template configuration, you can realize a unified and simple way of lake migration.

  • Ensure the timeliness of lake migration

You can complete like migration with only minute-level latency using exclusive Streaming SQL that meets the timeliness requirement.

  • Real-time data source changes

You can realize real-time data changes caused by update and delete operations in a better way with file formats such as Delta Lake.

Real-time Lake Migration

With the continuous development of big data, users' demand for data timeliness is increasing. Thus, real-time lake migration of data is critical. Currently, Alibaba Cloud supports real-time lake migration of Data Transmission Service (DTS), Tablestore, and Log Service (SLS).

Real-time Lake Migration of DTS Incremental Data

Alibaba Cloud offers a highly reliable data transmission service, DTS. It supports the subscription and consumption of incremental data from different types of databases. Alibaba Cloud is capable of lake migration of real-time subscripted DTS data. You can perform lake migration through existing subscription channels or automatically created subscription channels. This greatly reduces the configuration costs for your business.

4
5

Technically, historical data can be updated and deleted through operations on incremental data. With this, you can detect data changes in several minutes. In terms of technical implementation, the "merge into" syntax in Streaming SQL is extended to connect to the corresponding APIs of Delta Lake.

MERGE INTO delta_tbl AS target
USING (
  select recordType, pk, ...
  from {{binlog_parser_subquery}}
) AS source
ON target.pk = source.pk
WHEN MATCHED AND source.recordType='UPDATE' THEN
UPDATE SET *
WHEN MATCHED AND source.recordType='DELETE' THEN
DELETE
WHEN NOT MATCHED THEN
INSERT *

The data lake-based solution offers more advantages than the traditional warehouse migration of binlog. In a traditional data warehouse, two tables are usually maintained to store changed data, such as database data. One is an incremental table that stores the daily log of database changes, and the other is a full table that stores all historical merge data. You can perform merge operation in the full table and the increment table according to the primary key. Obviously, the data lake-based solution is better in terms of simplicity and timeliness.

Real-time Lake Migration of TableStore Data

Alibaba Cloud provides a multi-model NoSQL database service, Tablestore. It can store a large amount of structured data and support fast query and analysis. Support for tunnels and real-time changes of data consumption is also present in Tablestore. DLF offers lake migration of Tablestore full channel, incremental channel, and full-incremental channel. A full channel contains all historical data. An incremental channel contains data of incremental changes, and a full-incremental channel contains data of historically full and incremental changes.

6

Real-time Lake Migration of SLS Logs

Alibaba Cloud provides an all-in-one service, SLS, for storing user log data. You can archive log data in the SLS in a Data Lake in real-time for further analysis and processing to explore the full value of the data. You can also migrate the SLS logs to a data lake in real-time by using the lake migration template and filling in a small amount of information, such as project and logstore.

Summary

The lake migration solution reduces the migration cost of data from heterogeneous data sources. It also satisfies the timeliness requirements of data sources such as SLS and DTS and supports real-time data source changes. You can uniformly store data from different data sources in the OSS-based centralized data lake storage. Thus, you can avoid the data island problem and build a solid foundation for unified data analysis.

In the future, the all-in-one lake migration solution from Alibaba Cloud will continue to provide improved functions and support more types of data sources. In terms of migration template, more functions will be provided to users. It will also support the function of customized data extract-transform-load (ETL) to improve flexibility. Moreover, continuous performance optimization will be available to provide better timeliness and stability.

0 0 0
Share on

Alibaba EMR

35 posts | 3 followers

You may also like

Comments