This topic describes how to use the extract, transform, and load (ETL) feature to analyze real-time orders.

Scenarios

You can use the ETL feature to efficiently integrate large amounts of real-time data. The ETL feature supports drag-and-drop operations and minimizes hand-coding efforts. The ETL feature facilitates data analysis, decision making, reporting, and real-time data computing. The ETL feature is suitable for the following data processing scenarios:
  • Centralized management of multi-region or heterogeneous data in real time: To facilitate centralized and efficient management and decision-making, you can store heterogeneous data or data from different regions to the same database in real time.
  • Real-time reporting: To improve the efficiency of reporting and facilitate digital transformation, you can build a real-time reporting system. The system is suitable for various real-time analysis scenarios.
  • Real-time computing: You can clean the streaming data generated on the business side in real time to extract feature values and tags. Typical scenarios include online business computing models (such as profiling, risk control, and recommendations) or real-time big screens.

Background information

The ETL feature can be used to merge transaction data and business data, and ship the data that meets specified filter conditions to a data warehouse in real time. Transaction data includes order number, customer ID, product or commodity code, transaction amount, and transaction time. Business data includes product code, product unit price, and product name. For example, you can filter the real-time transaction information of orders that exceed USD 3,000 in price. When transaction data is shipped to the data warehouse, you can analyze the data based on multiple dimensions, such as product dimensions and customer dimensions. In addition, you can use tools to create a large screen and gain insights into dynamic data based on your business requirements.

Implementation process

Process of configuring an ETL task

Warning To ensure that you can successfully configure and run an ETL task, read and follow the Prerequisites and Precautions before you configure the task.
Process of configuring an ETL task
Step Description
Before you begin Store transaction data and business data in source tables and create destination tables based on your business requirements.
Note In this example, real-time transactions, business table, and destination table are stored in an ApsaraDB RDS for MySQL instance.
Step 1: Configure the source database Set the table that stores real-time transaction data as a stream table and set the table that stores business data as a dimension table.
Step 2: Configure the JOIN component Join the dimension table and the stream table into a wide table.
Step 3: Configure the Table Record Filter component Specify filter conditions to filter data in the wide table. For example, you can filter orders that exceed USD 3,000 in price.
Step 4: Configure the destination database Load the processed data to the destination table in real time.
Step 5: Precheck and start the task Precheck and start the ETL task.

Before you begin

Before you configure an ETL task, you must store real-time transaction data in a stream table and business data in a dimension table in the source ApsaraDB RDS for MySQL instance.

You must create tables in the destination ApsaraDB RDS for MySQL instance based on your business requirements.

Note The following section shows the statements for creating a transaction data table, business data table, and destination table.
create table test_orders(
   order_id bigint not null COMMENT 'Order ID',
     user_id bigint not null comment 'User ID',
     product_id bigint not null comment 'Product ID',
   total_price decimal(15,2) not null COMMENT 'Total Price',
   order_date TIMESTAMP  not null COMMENT 'Order Date',
   PRIMARY KEY (order_id))
CREATE table product (
      product_id bigint not null comment 'Product ID',
      product_name varchar(20) comment 'Product Name',
      product_price decimal(15,2) not null comment 'Unit Price')
create table test_orders(
 order_id bigint not null COMMENT 'Order ID',
 user_id bigint not null comment 'User ID',
 product_id bigint not null comment 'Product ID',
 total_price decimal(15,2) not null COMMENT 'Total Price',
 order_date TIMESTAMP  not null
 COMMENT 'Order Date',
 product_id_2 bigint not null comment 'Product ID',
 product_name varchar(20) comment 'Product Name',
 product_price decimal(15,2) not null comment 'Unit Price',
 PRIMARY KEY (order_id))

Step 1: Configure the source database

  1. Go to the ETL page.
    Note

    You can also perform the following steps to configure an ETL task in the DMS console:

    • Log on to the DMS console.
    • In the top navigation bar, click DTS.
    • In the left-side navigation pane, click Streaming ETL.
    • Click Create Data Flow in the upper-left corner. In the Create Data Flow dialog box, specify an ETL task name in the Data Flow Name field, set Processing Method to Stream Processing, and set Development Method to DAG Development.
    • Click OK.
  2. In the upper-left corner of the page, select the region where you create the ETL task.
    Note You can create an ETL task only in the China (Hangzhou), China (Beijing), or China (Zhangjiakou) region. Select a region based on the actual scenario.
  3. In the upper-left corner of the page, click Create Task.
  4. In the Create ETL Task step, set parameters for the ETL task.
  5. Set parameters for the stream table and the dimension table.
    1. Set parameters for the stream table.
      1. In the Input/Dimension Table section on the left side of the page, select MySQL and drag it to the canvas on the right side of the page.
      2. Click MySQL on the canvas. In the Input/Dimension Table: MySQL (Stream Table) section, set parameters for the stream table.
      3. On the Node Settings tab, set the required parameters. Stream Table_Node Settings
        Parameter Description
        Data Source Name DTS automatically generates a data source name. We recommend that you specify an informative name for easy identification. You do not need to use a unique name.
        Region Select the region where the source database resides.
        Database Connection Template Select the name of the template that stores the connection settings of the source database. You can also click Create Template to create a connection template. For more information, see Create a connection template.
        Node Type Select Stream Table or Dimension Table.

        In this example, select Stream Table.
        Conversion Format ETL converts a stream into a dynamic table and performs continuous queries on the dynamic table to generate a new dynamic table. In this process, the dynamic table is continuously changed by the INSERT, UPDATE, and DELETE operations. When the dynamic table is finally written to the destination database, the new dynamic table is converted back into a stream. When the new dynamic table is converted into a stream, you must specify the conversion format to encode the changes of the dynamic table.
        • Upsert Stream: The data in a dynamic table can be modified by using the INSERT, UPDATE, and DELETE operations. When the dynamic table is converted into a stream, the INSERT and UPDATE operations are encoded as upsert message and the DELETE operations are encoded as delete messages.
          Note A dynamic table that is converted into an upsert stream requires a (possibly composite) unique key.
        • Append-Only Stream: The data in a dynamic table can be modified only by the INSERT operation. When the dynamic table is converted into a stream, only the inserted rows are emitted.
        Select Databases and Tables Select the databases and tables that you want to transform.
      4. After the node is configured, you are redirected to the Output Fields tab. On this tab, select the column names based on your needs and set the related parameters.
      5. Click the Time Attributes tab and set the parameters. Stream Table_Time Attributes
        Parameter Description
        Select Event Time Watermark Select a time field in the stream table. In a stream table, a time field is defined to represent the time when data is generated. The time field is usually an informative timestamp, for example, ordertime.
        Latency of Event Time Watermark Enter the maximum data latency that you can accept.

        A latency may exist between the time when data is generated and the time when data is processed in ETL. ETL cannot wait for the delayed data indefinitely. You can use this parameter to specify the maximum latency for ETL to process out-of-order data. For example, if the data generated at 10:00 has arrived, but the data generated at 9:59 has not arrived, ETL will only wait until "10:00 + latency". If the data generated at 9:59 arrives after "10:00 + latency", ETL discards the data.

        Processing Time The time when data is processed in ETL. Enter a column name. ETL saves the local time of data processing in this column. The processing time is used for operator operations. For example, temporal JOIN operations use the processing time to associate the latest version of a standard table.
      Note If Configured appears in MySQL Stream Table, the stream table is configured.
    2. Set parameters for the dimension table.
      1. In the Input/Dimension Table section on the left side of the page, select MySQL and drag it to the canvas on the right side of the page.
      2. Click MySQL on the canvas. In the Input/Dimension Table: MySQL (Dimension Table) section, set parameters for the dimension table.
      3. On the Node Settings tab, set the required parameters. Dimension Table_Node Settings
        Parameter Description
        Data Source Name DTS automatically generates a data source name. We recommend that you specify an informative name for easy identification. You do not need to use a unique name.
        Region Select the region where the source database resides.
        Database Connection Template Select the name of the template that stores the connection settings of the source database. You can also click Create Template to create a connection template. For more information, see Create a connection template.
        Node Type Select Stream Table or Dimension Table.

        In this example, select Dimension Table.
        Select Databases and Tables Select the databases and tables that you want to transform.
      4. After the node is configured, you are redirected to the Output Fields tab. On this tab, select the column names based on your needs and set the related parameters.
      Note If Configured appears in MySQL Dimension Table, the dimension table is configured.

Step 2: Configure the JOIN component

  1. In the Transform section on the left side of the page, select JOIN and drag it to the canvas on the right side of the page.
  2. Move the pointer over the stream table, and click the dot to draw a line between the stream table and the JOIN component. Then, move the pointer over the dimension table, and click the dot to draw a line between the dimension table and the JOIN component.
    Connect the source tables and the JOIN component
  3. Click the JOIN component on the canvas. In the Transform: JOIN section, configure the JOIN component.
    1. On the Node Settings tab, set the required parameters.
      JOIN_Node Settings
      Section Parameter Description
      Transformation Name Enter Transformation Name DTS automatically generates a name for the transformation component. We recommend that you specify an informative name for easy identification. You do not need to use a unique name.
      JOIN Settings Left Table in JOIN Clause Select the left table in the JOIN clause. The table is used as the primary table. In this example, select the stream table.
      Temporal Join Time Attribute (Regular joins apply if not selected) Select the time attribute of the stream table associated with the temporal table when a temporal join is used. If you do not specify this parameter, regular join is used. In this example, select Based on Processing Time.
      Note
      • Temporal tables, also known as dynamic tables, refer to table-based views. Temporal tables record data change history based on time. Temporal tables include versioned tables and standard tables. Versioned tables can display historical versions of data. Standard tables display only the latest version of data.
      • A temporal join requires that time attributes must be defined for all stream tables, and the right table must have a primary key. If the right table is a dimension table, the primary key of the table must be contained in the specified JOIN conditions.
      • Based on Event Time Watermark: uses the event time of the stream table to associate the version of the versioned table.
      • Based on Processing Time: uses the processing time of the stream table to associate the latest version of the standard table.
      Select JOIN Operation Select a JOIN operation. In this example, select Inner Join.
      • Inner Join: The data is the intersection of two tables.
      • Left Join: obtains all data in the left table and the intersection of the two tables in the right table.
      • Right Join: obtains the intersection of the two tables in the left table and all data in the right table.
      JOIN Conditions + Add Condition Click + Add Condition and select JOIN conditions.
      Note The fields on the left side of the equal sign (=) belong to the left table. The fields on the right side of the equal sign (=) belong to the right table.
  4. After you configure the JOIN conditions, click the Output Fields tab. On this tab, select the column names based on your needs and set the related parameters.
Note If Configured appears in JOIN, the JOIN component is configured.

Step 3: Configure the Table Record Filter component

  1. In the Transform section on the left side of the page, select Table Record Filter and drag it to the canvas on the right side of the page.
  2. Move the pointer over the JOIN component, and click the dot to draw a line between the JOIN component and the Table Record Filter component.
  3. Click the Table Record Filter component on the canvas. In the Transform: Table Record Filter section, configure the Table Record Filter component.
    Connect the JOIN and Table Record Filter components
    1. In the Transformation Name section, enter the name of the Table Record Filter component.
      Note DTS automatically generates a name for the transformation component. We recommend that you specify an informative name for easy identification. You do not need to use a unique name.
    2. In the WHERE Condition section, specify a WHERE condition by using one of the following methods:
      • Enter the WHERE condition. For example, enter total_priceid>3000.00 to filter the total_price fields with values greater than 3000.00 in the joined table.
      • Click an option in the Field Input or Operator section to specify the WHERE condition.
Note If Configured appears in Table Record Filter, the Table Record Filter component is configured.

Step 4: Configure the destination database

  1. In the Output section on the left side of the page, select MySQL and drag it to the canvas on the right side of the page.
  2. Move the pointer over the Table Record Filter component, and click the dot to draw a line between the Table Record Filter component and the destination MySQL database.
  3. In the Output: MySQL section, set the required parameters.
    1. On the Node Settings tab, set the required parameters.
      Configure the destination database
      Parameter Description
      Data Source Name DTS automatically generates a data source name. We recommend that you specify an informative name for easy identification. You do not need to use a unique name.
      Region Select the region where the destination database resides.
      Database Connection Template Select the name of the template that stores the connection settings of the destination database. You can also click Create Template to create a connection template. For more information, see Create a connection template.
      Table Mapping Select the table to be stored in the destination database.

      In the Select Destination Table section, click the destination table.

  4. On the Field Mapping tab, select the column names based on your needs and set the related parameters.
Note If Configured appears in Destination MySQL database, the destination database is configured.

Step 5: Precheck and start the task

  1. After you complete the preceding configurations, click Generate Flink SQL Validator. ETL generates and validates Flink SQL statements.
  2. After Flink SQL validation is completed, click View ETL Validation Details. In the dialog box that appears, view the Flink SQL generation results and SQL statements. Confirm the results and click Close.
    Note If the validation fails, you can troubleshoot the failure based on the causes displayed in the generated results.
  3. Click Next: Save Task Settings and Precheck in the lower part of the page. DTS can start the ETL task only after the task passes the precheck. If the task fails to pass the precheck, click View Details next to each failed item. Troubleshoot the issues based on the causes and run a precheck again.
  4. After the precheck is completed, click Next: Purchase Instance in the lower part of the page.
  5. On the Purchase Instance page, specify the Instance Class and Compute Units (CUs) parameters. Then, read and select Data Transmission Service (Pay-as-you-go) Service Terms and Service Terms for Public Preview.
    Note During public preview, each user can create two ETL instances for free.
  6. Click Buy and Start to start the ETL task.

Running results

In this example, after the ETL task is started (for example, on August 1), if the updated data in the real-time transaction table test_orders meets the filter condition (total_priceid>3000.00, that is, the total transaction volume is greater than 3000.00), the data is synchronized to the destination table test_orders_new.

Figure 1. Real-time transaction data table: test_orders
Destination table: test_orders_new
Figure 2. Destination table: test_orders_new
Business data table: test_orders

What to do next

If you want to visualize the data in the destination table, we recommend that you use the data visualization feature provided by Data Management (DMS). This feature is suitable for analysis scenarios such as trend analysis and growth comparison. For more information, see Data visualization.