Tablestore combined with Spark's streaming batch integration SQL practice

background introduction

The e-commerce model refers to the business operation mode and profit model based on a certain technical basis in the network environment and big data environment. The analysis and visualization of data is one of the most important parts of e-commerce operations, and the e-commerce large screen provides The perfect combination of data analysis and visualization. The e-commerce large screen includes the aggregation of full orders and real-time orders. The aggregation of full orders provides a panoramic comprehensive data view, while the aggregation of real-time orders displays real-time operational indicator data. In this paper, by combining Tablestore and Spark's stream-batch integrated storage and calculation, the e-commerce large screen will be built to complete the analysis and visualization of e-commerce data. The effect diagram is as follows.

architecture design

In this e-commerce large-screen actual combat, the client will insert the original order data into the Tablestore in real time, and the real-time stream computing will use Spark Structured Streaming to count the number of orders and the order amount within a window period in real time, and write the aggregated results to Back to Tablestore, and finally displayed on the large screen of DataV, and offline batch calculation uses Spark SQL to perform offline aggregation of the total amount of the original order data and the total amount of the user dimension, and the aggregation result will also be written back to Tablestore, and finally displayed on the large screen of DataV For display, the architecture diagram of the entire scene is shown in the figure below.

Preparation

1. Create a Hadoop cluster of Alibaba Cloud E-MapReduce. For the documentation, see Creating a Cluster.
2. Download the latest SDK package of E-MapReduce, the format of the package name is `js
emr-datasources_shaded_*.jar
, ` will contain the Spark batch stream Source and Sink related to Tablestore.

Data source description

The data source is a simple original order table OrderSource. The table has two primary keys UserId (user ID) and OrderId (order ID) and two attribute columns price (price) and timestamp (order time). The data example is shown in the figure below .

Detailed explanation of batch SQL process

Create data source table

1. Log in to the EMR Header machine and execute the following command to start the sql client, which is used for batch SQL calculation, where emr-datasources_shaded_*.jar is the latest version of the EMR SDK package downloaded in the preparation work.

2. Create the outer order_source of the original order data table (Source table), which will be used for subsequent stream batch SQL execution.

real-time stream computing

Real-time stream computing will count the number of orders and order amount within a window period in real time, and write the aggregated results back to Tablestore. First create the Sink external order_stream_sink of Stream Computing (corresponding to Tablestore table OrderStreamSink), then run Stream Computing SQL for real-time aggregation, and finally write the aggregation result back to the Tablestore destination table in real time.

The meanings of the parameters in the Sink table are the same as those in the Source table, and the content of the catalog field is different. There are four fields in the corresponding Sink table, begin(start time, primary key column, the format is 2019-11-27 14:54:00 ), end (end time, primary key column), count (number of orders), totalPrice (total amount of orders).

After running Stream SQL, the aggregation result can be obtained in real time. The aggregation result sample is shown in the figure below. The aggregation result is stored in the OrderStreamSink table. Through the direct connection function between Tablestore and DataV, the result can be easily drawn on the large screen of DataV superior.

Offline Batch Computing

Offline batch calculation will perform offline aggregation of the total amount of the original order data and the total amount of the user dimension. First, two sink tables will be created to store the aggregated data of the historical total amount and the total amount of the user dimension, and then the batch will be run directly on the source table order_source Calculate the SQL, and finally get the aggregated result.

Run the following batch calculation SQL to update the user dimension aggregation results.

// SQL command
INSERT INTO order_batch_sink SELECT UserId, count(*) AS count, sum(price) AS totalPrice FROM order_source GROUP BY UserId;
// actually run
spark-sql> INSERT INTO order_batch_sink SELECT UserId, count(*) AS count, sum(price) AS totalPrice FROM order_source GROUP BY UserId;
Time taken: 5.107 seconds

Run the following batch calculation SQL to update the total data dimension results.

// SQL command
INSERT INTO order_total_sink SELECT count(*) AS count, sum(price) AS totalPrice FROM order_source;
// actually run
spark-sql> INSERT INTO order_total_sink SELECT count(*) AS count, sum(price) AS totalPrice FROM order_source;
Time taken: 4.272 seconds

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us