DLF + DDI Best Practices for One-Stop Data Lake Formation and Analysis

Preface

This article was written jointly by the Alibaba Cloud Data Lake Formation (DLF) Team and the Databricks DataInsight (DDI) Team. It aims to give readers have a deeper understanding of Alibaba Cloud Data Lake Formation and Databricks DataInsight to build a one-stop cloud data lake solution.

Authors

Chen Xinwei (Xikang), Technical Expert of Alibaba Cloud Computing Platform
Feng Jialiang (Jialiang), Technical R&D of Alibaba Cloud Computing Platform

Background

Since the development in the data era, the amount of data is growing explosively, and the data forms are becoming more diverse. The traditional data warehouse model has increasingly prominent problems, such as high costs, slow response, and few supported formats. As a result, a data lake with a lower cost of ownership, richer data formats, and more flexible analysis and computing capabilities was created.

As a centralized data warehouse, data lakes support a variety of data types, including structured, semi-structured, and unstructured data. The data sources include database data, incremental binlog data, log data, and inventory data on existing data warehouses. Data lakes can manage the centralized storage of these data from different sources and formats in the cost-effective storage, Object Storage Service (OSS), and provide a unified data catalog to support multiple computing and analysis methods. This effectively solves the problem of data silos faced by enterprises and reduces the cost of storing and using data for enterprises significantly.

Data Lake Architecture and Key Technologies

The enterprise-level data lake architecture is pictured below:

Data Lake Storage and Format

Data lake storage mainly uses Object Storage Service as the main medium, which has the advantages of low costs, high stability, and high scalability.

We can use data lake storage formats that support ACID (atomicity, consistency, isolation, durability), such as Delta Lake, Hudi, and Iceberg. These data lake formats have their own data meta management capabilities and can support operations, such as Update and Delete. They solve the problem of real-time data updates in big data scenarios in a batch-stream integrated manner. In the current solution, we mainly introduce the core capabilities and use scenarios of Delta Lake.

Core Competencies of Delta Lake

Delta Lake is a unified data management system that brings data reliability and fast analysis to data lakes on the cloud. Delta Lake runs on top of an existing data lake and is fully compatible with Apache Spark's API. With Delta Lake, you can speed up the import of high-quality data into your data lake. Teams can also use this data quickly on cloud services, securely and in a scalable way.

Atomicity, Consistency, Isolation, and Durability Transactional: Delta Lake provides atomicity, consistency, isolation, and durability transactional between multiple write operations. Each write operation is a transaction operation. The write operations recorded in the transaction log have a sequence. Transaction logs track write operations at the file level and use optimistic locks for concurrency control. This is very suitable for data lakes because it does not try to modify multiple write operations of the same file very often. When a conflict occurs, Delta Lake throws a concurrent modification exception, which is thrown for users to process and retry their jobs. Delta Lake also provides the highest level of isolation (serializable fencing), allowing engineers to write data to a directory or table continuously while customers read data from the same directory or table continuously. When reading data, they will see the latest snapshot of the data.
Schema Management: Delta Lake verifies whether the schema of the DataFrame being written is compatible with the schema of the table automatically. If a column exists in the table but does not exist in the DataFrame, it is set to null. If an additional column in the DataFrame is not in the table, the operation throws an exception. Delta Lake can explicitly add new columns in Data Definition Language (DDL) and update schemas automatically.
Scalable Metadata Processing: Delta Lake stores the metadata of tables or directories in transaction logs instead of the metadata Metastore. This allows Delta Lake to list files in large directories for a fixed period of time. This method is efficient when reading data.
Data Version Control and Time Travel: Delta Lake allows users to read historical version snapshots of tables or directories. When a file is modified during writing, Delta Lake creates a new version of the file and retains the old version. When users want to read an older version of a table or directory, they can provide a timestamp or version number to Apache Spark's read API. Delta Lake builds a full snapshot of that timestamp or version based on the information in the transaction log. This is very convenient for users to reproduce experiments and reports. If necessary, they can also restore the table to the old version.
Integrated Batch Stream: In addition to batch writing, Delta Lake can be used as an efficient streaming sink for structured streams of Apache Spark. Combined with atomicity, consistency, isolation, durability transactions, and scalable metadata processing, an efficient streaming sink supports a large number of near real-time analytics use cases without maintaining complex streaming and batch pipelines.
Record Updates and Deletes: Delta Lake will support Data Management Language (DML) commands for merging, updating, and deleting. This makes it easy for engineers to insert and delete records in the data lake and simplify their change data capture and General Data Protection Regulation (GDPR) use cases. Since Delta Lake tracks and modifies data at the file level, it is much more efficient than reading and overwriting entire partitions or tables.

Data Lake Formation and Management

1. Data Ingestion

The raw data of an enterprise exists in a variety of databases or storage systems, such as the relational database MySQL, the log system SLS, the NoSQL database HBase, and the message database Kafka. Most of the online storage is for online transactional businesses and is not suitable for online analysis scenarios. Therefore, data needs to be synchronized in a non-intrusive manner to Object Storage Service that is lower in cost and more suitable for computing and analysis.

Commonly used data synchronization methods include batch synchronization based on data synchronization tools, such as DataX and Sqoop. At the same time, in scenarios with high real-time requirements, streaming synchronization links, such as Kafka + Spark Streaming / Flink, are used together. Currently, many cloud vendors provide a one-stop solution to enter the lake to help customers realize data entering the lake in a faster and lower-cost way, such as Alibaba Cloud DLF data entering the lake.

2. Unified Metadata Service

Object Storage Service (OSS) does not have semantics for big data analysis. It needs to combine metadata services, such as Hive Metastore Service to provide metadata information for various upper-layer analysis engines. The design goal of the data lake metadata service is to build a unified metadata view of different storage systems, formats, and different computing engines in an environment of big data engine and storage diversity. It should have unified permissions and metadata. It should also be compatible and expand the open-source big data ecosystem metadata service, support the automatic acquisition of metadata, and achieve the purpose of managing multiple uses at one time so that it can be compatible with the open-source ecosystem. It is also easy to use.

Data Lake Computing and Analysis

Compared with data warehouses, data lakes connect to a variety of different computing engines in a more open way, such as the traditional open-source big data computing engines Hive, Spark, Presto, and Flink. They also support big data engines developed by cloud service providers, such as Alibaba Cloud MaxCompute and Hologres. Between data lake storage and computing engines, data lake acceleration services are generally provided to improve the performance of computing and analysis, while reducing the cost and pressure of bandwidth.

Databricks DataInsight – Spark Data Computing and Analytics Engine for Business Edition

As a fully managed Spark analytics engine on Alibaba Cloud, DataBricks DataInsight helps users calculate and analyze data in a data lake.

SaaS Fully-Managed Spark: It does not need to pay attention to underlying resources. It reduces O&M costs and focuses on analytics services.
Complete Spark Technology Stack Integration: It boasts one-stop integration of the Spark engine and Delta Lake and is 100% compatible with the open-source Spark community edition. It uses Databricks for commercial support, the fastest experience of the latest version of Spark features.
Total Cost Reduction: The commercial versions of Spark and Delta Lake have significant performance advantages. At the same time, based on the compute-storage separation architecture, storage relies on Alibaba Cloud Object Storage Service (OSS) and is accelerated by the Alibaba Cloud JindoFS cache layer. This can effectively reduce the overall cost of cluster usage.
High-Quality Support and SLA Guarantee: Alibaba Cloud and Databricks provide technical support covering the full stack of Spark, commercial SLA guarantee, and 24/7 Databricks expert support services.

Databricks DataInsight + DLF Data Lake Formation and Flow Batch Integration Analysis Practice

When an enterprise builds and uses a data lake, the entire process involves data ingestion, data lake storage and management, and data lake exploration and analytics. This article mainly introduces how to build a one-stop data entry into the lake based on Alibaba Cloud Data Lake Formation (DLF) and Databricks DataInsight (DDI).

Stream Processing Scenarios:

Real-Time Scenario maintenance updates two Delta tables:

delta_aggregates_func table: RDS data enters the lake in real-time.
Table of delta_aggregates_metrics: Industrial metric data is collected to Kafka through the IoT platform and entered into the lake in real-time through the Spark Structured Streaming.

Batch Processing Scenarios:

Two DELTAs are generated in real-time scenarios as data sources for data analysis, and the execution of Spark jobs are scheduled through Databrick DataInsight.

Prerequisites

1. Activate the Service

Make sure that cloud services, such as DLF, OSS, Kafka, DDI, RDS, and DTS, are activated. Note: DLF, RDS, Kafka, and DDI instances must be in the same region.

2. Prepare RDS Data

RDS Data Preparation: Create the dlfdb database in RDS. Create a user account that can read engine_funcs databases in the account center, such as dlf_admin.

Log on to the database through DMS, run the following statement to create the engine_funcs table, and insert a small amount of data.

CREATE TABLE `engine_funcs` (  `emp_no` int(11) NOT NULL,
  `engine_serial_number` varchar(20) NOT NULL,
  `engine_serial_name` varchar(20) NOT NULL,
  `target_engine_serial_number` varchar(20) NOT NULL,
  `target_engine_serial_name` varchar(20) NOT NULL,
  `operator` varchar(16) NOT NULL,
  `create_time` DATETIME NOT NULL,
  `update_time` DATETIME NOT NULL,
  PRIMARY KEY (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

INSERT INTO `engine_funcs` VALUES (10001,'1107108133','temperature','1107108144','temperature','/', now(), now());
INSERT INTO `engine_funcs` VALUES (10002,'1107108155','temperature','1107108133','temperature','/', now(), now());
INSERT INTO `engine_funcs` VALUES (10003,'1107108155','runTime','1107108166','speed','/', now(), now());
INSERT INTO `engine_funcs` VALUES (10004,'1107108177','pressure','1107108155','electricity','/', now(), now());
INSERT INTO `engine_funcs` VALUES (10005,'1107108188','flow' ,'1107108111','runTime','/', now(), now());

Real-Time Entry of RDS Data into the Lake

1. Create a Data Source

Go to the DLF console → https://dlf.console.aliyun.com/cn-hangzhou/home and choose Data Into Lake → Data Source Management
Click Create Data Source, enter the connection name, select the RDS instance used in data preparation, enter the account password, and click "Connection Test" to verify network connectivity and account availability

Click Next and OK to create the data source

2. Create a Metadatabase

Create a bucket in OSS named databricks-data-source

On the left-side navigation pane, choose "metadata management" → "metabase" and click "new metabase." Enter a name, create a new directory dlf/, and choose the directory.

3. Create a Task to Enter the Lake

Choose "data into the lake" → "into the lake task management" and click "new into the lake task"
Select Relational Database to enter the lake in real-time. Then, enter the data source, target data lake, task configuration, and other information shown in the following figure. Then, save it.
Configure the data source, select the newly created "dlf" connection, use the table path "dlf/engine_funcs," select the new DTS subscription, and enter a name:

Go back to the Task Management page and click "Run" to create a new task for entering the lake. You will see the task enter the "Initializing" state and then enter the "Running" state.
Click "Details" to enter the task details page. You can see the corresponding database table information:

This data entry task belongs to the full and incremental entry task. After about 3-5 minutes, the full data will be imported and enter the real-time monitoring states automatically. If the data is updated, the data is updated to Delta Lake automatically.

Data Lake Exploration and Analytics

DLF Data Query Exploration

DLF products provide lightweight data preview and exploration functions. Choose "Data Exploration" → "SQL Query" to enter the data query page.

In the metadatabase table, find "fjl_dlf" and expand it to see that the engine_funcs_delta table has been created automatically. Double-click the table name, and the SQL statement for querying the table appears in the SQL edit box on the right. Then, click "run" to obtain the data query result.

Go back to the DMS console and run the DELETE and INSERT SQL statements below.

DELETE  FROM 'engine_funcs' where 'emp_no'=10001;
UPDATE 'engine_funcs' SET 'operator'='+', 'update_time'=NOW ()  WHERE 'emp_no'=10002;
INSERT  INTO 'engine_funcs' VALUES  (20001, '1107108199', 'speed', '1107108122', 'runTime', '*', now (), now ());

After about 1-3 minutes, execute the select statement in DLF data exploration again. All data updates have been synchronized to the data lake.

Create a Databricks DataInsight (DDI) Cluster

After the cluster is created, click "Details" to go to the details page and add the current access machine IP address whitelist.

Click Notebook to go to the Interactive Analytics page to query the engine_funcs_delta table data synchronized to Delta Lake.

IoT Platform Collects Data from Cloud Kafka and Writes It to Delta Lake in Real-Time

1. Introducing `spark-sql-kafka` tripartite Dependencies

% spark. conf

spark. jars. packages org. apache. spark :spark-sql-kafka -0-10_2.12:3.0.1

2. Using UDF Functions to Define Merge Rules for Writing Stream Data to Delta Lake

The format of the test data sent to Kafka is listed below:

{"sn": "1107108111","temperature": "12" ,"speed":"1115", "runTime":"160","pressure":"210","electricity":"380","flow":"740","dia":"330"}
{"sn": "1107108122","temperature": "13" ,"speed":"1015", "runTime":"150","pressure":"220","electricity":"390","flow":"787","dia":"340"}
{"sn": "1107108133","temperature": "14" ,"speed":"1215", "runTime":"140","pressure":"230","electricity":"377","flow":"777","dia":"345"}
{"sn": "1107108144","temperature": "15" ,"speed":"1315", "runTime":"145","pressure":"240","electricity":"367","flow":"730","dia":"430"}
{"sn": "1107108155","temperature": "16" ,"speed":"1415", "runTime":"155","pressure":"250","electricity":"383","flow":"750","dia":"345"}
{"sn": "1107108166","temperature": "10" ,"speed":"1515", "runTime":"145","pressure":"260","electricity":"350","flow":"734","dia":"365"}
{"sn": "1107108177","temperature": "12" ,"speed":"1115", "runTime":"160","pressure":"210","electricity":"377","flow":"733","dia":"330"}
{"sn": "1107108188","temperature": "13" ,"speed":"1015", "runTime":"150","pressure":"220","electricity":"381","flow":"737","dia":"340"}
{"sn": "1107108199","temperature": "14" ,"speed":"1215", "runTime":"140","pressure":"230","electricity":"378","flow":"747","dia":"345"}

%spark
import org.apache.spark.sql._
import io.delta.tables._
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {
  microBatchOutputDF.createOrReplaceTempView("dataStream")
  // Execute the column-to-row operation for streaming data to the DF;
    val df=microBatchOutputDF.sparkSession.sql(s"""
                select `sn`,
                 stack(7, 'temperature', `temperature`, 'speed', `speed`, 'runTime', `runTime`, 'pressure', `pressure`, 'electricity', `electricity`, 'flow', `flow` , 'dia', `dia`) as (`name`, `value` )
                 from  dataStream  
    """)
    df.createOrReplaceTempView("updates")
    // Update dynamic data in real time and merge the result into the table.
    val mergedf=df.sparkSession.sql(s"""
    MERGE INTO delta_aggregates_metrics t
    USING updates s
    ON  s.sn = t.sn and s.name=t.name
    WHEN MATCHED THEN UPDATE SET 
    t.value = s.value,
    t.update_time=current_timestamp()
    WHEN NOT MATCHED THEN INSERT 
    (t.sn,t.name,t.value ,t.create_time,t.update_time)
   values (s.sn,s.name,s.value,current_timestamp(),current_timestamp())
  """)
}

3. Writing to Delta Lake Using Spark Structured Streaming Real-Time Streams

%spark
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger

def getquery(checkpoint_dir:String,servers:String,topic:String ){
    var streamingInputDF =  
  spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", servers)
    .option("subscribe", topic)     
    .option("startingOffsets", "latest")  
    .option("minPartitions", "10")  
    .option("failOnDataLoss", "true")
    .load()
var streamingSelectDF = 
  streamingInputDF
   .select(
    get_json_object(($"value").cast("string"), "$.sn").alias("sn"),
   get_json_object(($"value").cast("string"), "$.temperature").alias("temperature"),
   get_json_object(($"value").cast("string"), "$.speed").alias("speed"),
   get_json_object(($"value").cast("string"), "$.runTime").alias("runTime"),
   get_json_object(($"value").cast("string"), "$.electricity").alias("electricity"),
   get_json_object(($"value").cast("string"), "$.flow").alias("flow"),
   get_json_object(($"value").cast("string"), "$.dia").alias("dia"),
   get_json_object(($"value").cast("string"), "$.pressure").alias("pressure")
   )
val query = streamingSelectDF
      .writeStream
      .format("delta")
      .option("checkpointLocation", checkpoint_dir)
      .trigger(Trigger.ProcessingTime("5 seconds")) // Execute stream processing interval
      .foreachBatch(upsertToDelta _) // Reference the upsertToDelta function.
      .outputMode("update")
      .start()
}

4. Execute Program

% spark
val my_checkpoint_dir="oss://databricks-data-source/checkpoint/ck"
val servers="***.***.***.***:9092" 
val topic="your-topic"
getquery(my_checkpoint_dir,servers,topic)

5. Start Kafka and Send Test Data to Production

Query data is written and updated in real-time

Query engine_funcs_delta data synchronized from MySQL to the lake in real-time

% spark
val rds_dataV=spark. table ("fjl_dlf.engine_funcs_delta")
rds_dataV. show ()

Batch Jobs

You need to join the Value parameter in the corresponding delta_aggregates_metrics to the engine_funcs_delta table in combination with the business.

%spark
// Read the updated delta_aggregates_metrics data table in real time.
val aggregateDF=spark.table("log_data_warehouse_dlf.delta_aggregates_metrics")
// Read the engine_funcs_delta function table that is updated in real time.
val rds_dataV=spark.table("fjl_dlf.engine_funcs_delta").drop("create_time","update_time")
// rds_dataV.show()
val aggregateSDF= aggregateDF.withColumnRenamed("value","esn_value").withColumnRenamed("name","engine_serial_name").withColumnRenamed("sn","engine_serial_number")
// aggregateSDF.show()
val aggregateTDF=aggregateDF.withColumnRenamed("value","tesn_value").withColumnRenamed("name","target_engine_serial_name").withColumnRenamed("sn","target_engine_serial_number").drop("create_time","update_time")
// aggregateTDF.show()
// Join the Value parameter in the corresponding delta_aggregates_metrics to the engine_funcs_delta table;
val  resdf=rds_dataV.join(aggregateSDF,Seq("engine_serial_name","engine_serial_number"),"left").join(aggregateTDF,Seq("target_engine_serial_number","target_engine_serial_name"),"left")
            .selectExpr("engine_serial_number","engine_serial_name","esn_value","target_engine_serial_number","target_engine_serial_name","tesn_value","operator","create_time","update_time")

// Data display
resdf.show(false)
// Write the result to the Delta table.
resdf.write.format("delta")
    .mode("append")
    .saveAsTable("log_data_warehouse_dlf.delta_result")

Performance Optimization: OPTIMIZE & Z-Ordering

In stream processing scenarios, a large number of small files are generated. The existence of a large number of small files seriously affects the read performance of the data system. Delta Lake provides the OPTIMIZE command to merge and compress small files. In addition, for Ad-Hoc query scenarios, we use the Z-Ordering mechanism provided by Delta Lake to improve the query performance because it involves queries on data in multiple dimensions of a single table. This improves the performance of reading tables significantly. Delta Lake provides the Auto Optimize option, but it sacrifices a small amount of write performance and increases the latency for writing data to the delta table. On the contrary, executing the OPTIMIZE command does not affect the write performance because Delta Lake supports MVCC and simultaneous write operations of OPTIMIZE. Therefore, we adopt the scheme of triggering the execution of OPTIMIZE periodically, merging small files through OPTIMIZE every hour, and executing VACCUM to clean up stale data files:

OPTIMIZE log_data_warehouse_dlf. delta_result ZORDER by engine_serial_number;

VACUUM log_data_warehouse_dlf. delta_result RETAIN 1 HOURS;

References

I. Introduction to the Basics and Performance of Delta Lake:

Article: Delta-Lakes-Fundamentals-and-Performance.pdf

II. Delta Lake Optimization:

Article:
https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html
Documentation:
https://help.aliyun.com/document_detail/148379.html (Article in Chinese)

Click the link below for more detailed information about Databricks DataInsight:
https://databricks.com/product/alibaba

Click the link below for more detailed information about Data Lake Analytics:
https://www.alibabacloud.com/product/data-lake-analytics