An administrator of a shopping platform wants to view the daily order quantities and transaction amounts by commodity category on the platform. This topic describes how to use the task orchestration feature of Data Management (DMS) to configure a cross-database Spark SQL node. The administrator can use the node to periodically synchronize the data of the order and commodity tables from an online database to a data warehouse for data analytics. Then, the analysis results are synchronized from the data warehouse to the online database for data queries by the administrator.

Prerequisites

  • An online MySQL database that is used to store the order and commodity tables is prepared, and you have change permissions on the database. For more information about how to apply for permissions, see Overview.
  • An AnalyticDB for MySQL data warehouse for data processing is prepared, and you have change permissions on the data warehouse. For more information about how to apply for permissions, see Overview.

Background information

The online database of a shopping platform generates a large amount of data. If you directly analyze the data in the online database, this slows down the response time of the online database. In severe circumstances, the online database may fail to respond to business requests. Therefore, before you analyze the data in the online database, you need to synchronize the data from the online database to a data warehouse for data processing. After the data is processed in the data warehouse, you need to synchronize the data to the online database. This way, your online application can provide data analytics and statistics services.

Procedure

  1. Create tables.
  2. Create a cross-database Spark SQL node.
  3. Configure the cross-database Spark SQL node.
  4. Publish the task flow.

Create tables

In the MySQL database, create an order table, a commodity table, and a statistics table. In the AnalyticDB for MySQL data warehouse, create an order table and a commodity table that are used to store the data that is received from the order and commodity tables in the MySQL database and a statistics table that is used to store analysis results.

  1. Log on to the DMS console V5.0.
  2. In the top navigation bar, choose SQL Console > SQL Console.
  3. In the Please select the database first dialog box, select the MySQL database from the drop-down list and click Confirm.
  4. In the MySQL database, create an order table named t_order, a commodity table named t_product, and a statistics table named t_order_report_daily for receiving analysis results.
    1. Create an order table named t_order. Copy and paste the following SQL statement to the SQL editor and click Execute.
      Execute the following SQL statement to create the t_order table:
      CREATE TABLE `t_order` (
        `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Primary key',
        `product_id` bigint(20) NOT NULL COMMENT 'Commodity ID',
        `gmt_create` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Creation time',
        `gmt_modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Modification time',
        `customer_id` bigint(20) NOT NULL COMMENT 'Customer ID',
        `price` decimal(14,2) NOT NULL COMMENT 'Price',
        `status` varchar(64) NOT NULL COMMENT 'Order status',
        `province` varchar(256) DEFAULT NULL COMMENT 'Province',
        PRIMARY KEY (`id`),
        KEY `idx_product_id` (`product_id`),
        KEY `idx_customer_id` (`customer_id`),
        KEY `idx_status` (`status`)
      ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT COMMENT='Order table'
      ;
    2. Create a commodity table named t_product. Copy and paste the following SQL statement to the SQL editor and click Execute.
      Execute the following SQL statement to create the t_product table:
      CREATE TABLE `t_product` (
        `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Primary key',
        'name' varchar(128) NOT NULL COMMENT 'Commodity name',
        `type` varchar(64) NOT NULL COMMENT 'Commodity type',
        `gmt_create` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Creation time',
        `gmt_modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Modification time',
        PRIMARY KEY (`id`)
      ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='Commodity table'
      ;
    3. Create a statistics table named t_order_report_daily. Copy and paste the following SQL statement to the SQL editor and click Execute.
      Execute the following SQL statement to create the t_order_report_daily table:
      CREATE TABLE `t_order_report_daily` (
        `dt` varchar(64) NOT NULL COMMENT 'Date on which statistics are collected',
        `product_type` varchar(64) NOT NULL COMMENT 'Commodity type',
        `cnt` bigint(64) NOT NULL COMMENT 'Order quantity',
        `amt` decimal(38,2) NOT NULL  COMMENT 'Transaction amount',
        PRIMARY KEY (`dt`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Statistics table'
      ;
  5. In the top navigation bar, choose SQL Console > SQL Console.
  6. In the Please select the database first dialog box, select the AnalyticDB for MySQL data warehouse from the drop-down list and click Confirm.
  7. In the AnalyticDB for MySQL data warehouse, create an order table named t_order, a commodity table named t_product, and a statistics table named t_order_report_daily for storing analysis results.
    Note For more information about the SQL statements that are used to create the t_order, t_product, and t_order_report_daily tables, see Step 4 of this section.
  8. Write test data to the t_order and t_product tables in the MySQL database. You can use the test data generation feature of DMS to generate test data. For more information, see Generate test data.
    • Generate 20 million records of test data in the t_order table.
      Note If you use an instance that is managed in Flexible Management mode, you can use the test data generation feature to generate up to one million records of test data in a ticket.
    • Generate 10 thousand records of test data in the t_product table.

Create a cross-database Spark SQL node

  1. Log on to the DMS console V5.0.
  2. In the top navigation bar, click DTS. In the left-side navigation pane, choose Data Development > Task Orchestration.
  3. Create a task flow.
    1. Click Create Task Flow.
    2. In the Create Task Flow dialog box, set the Task Flow Name and Description parameters and click OK.
  4. Find the task flow that you created in the preceding step and click the name of the task flow to go to the details page of the task flow. In the Task Type list on the left side of the canvas, drag the Cross-Database Spark SQL node to the canvas.

Configure the cross-database Spark SQL node

  1. On the details page of the task flow, double-click the cross-database Spark SQL node that you created on the canvas.
  2. Configure a node variable. In this step, ${today} is used as an example. For more information about variables, see Variables.
    1. Click the Variable Setting tab on the right side.
    2. Click the Node Variable tab.
    3. Set the Variable Name and Time Format parameters, as shown in the following figure.
  3. Add the MySQL database that stores commodity orders.
    1. In the Database Reference section, click Add Database Reference.
    2. On the page that appears, set the Database Type, Database, and Alias in Spark SQL parameters. In this example, set the Alias in Spark SQL parameter to demo_id. For more information about the parameters, see Configure a cross-database Spark SQL node.
    3. Click Save.
  4. Add the AnalyticDB for MySQL data warehouse that is used for data processing.
    1. Click the Plus icon icon to the right of the demo_id database.
    2. Set the Database Type, Database, and Alias in Spark SQL parameters. In this example, set the Alias in Spark SQL parameter to company.
    3. Click Save.
    Database Reference
  5. Write Spark SQL statements in the SQL editor and click Save.
    /*Synchronize data*/
    /*Synchronize full data from the t_product table*/
    insert overwrite company.t_product
    select id,
           name,
           type,
           gmt_create,
           gmt_modified
      from demo_id.t_product ;
    
    /*Synchronize incremental data from the t_order table*/
    insert into company.t_order
    select id,
           product_id,
           gmt_create,
           gmt_modified,
           customer_id,
           price,
           status,
           province
      from demo_id.t_order
     where gmt_create>= '${bizdate}'
       and gmt_create< '${today}' ;
    
    /*Process data and group the data by commodity category* /
    insert into company.t_order_report_daily
    select '${bizdate}',
           p.type as product_type,
           count(*)  order_cnt,
           sum(price)  order_amt
      from company.t_product p join company.t_order o on o.product_id= p.id
     where o.gmt_create>= '${bizdate}'
       and o.gmt_create< '${today}'
     group by product_type;
    
     /*Synchronize the processed data from the AnalyticDB for MySQL data warehouse to the MySQL database*/
    insert into demo_id.t_order_report_daily
    select dt,
           product_type,
           order_cnt,
           order_amt
    from company.t_order_report_daily
    where dt= '${bizdate}';

Publish the task flow

  1. On the details page of the task flow, click Try Run in the upper-left corner of the canvas.
    Click the Execution Logs tab to view the execution results.
    • If status SUCCEEDED appears in the last line of the logs, the dry run is successful.
    • If status FAILED appears in the last line of the logs, the dry run fails.
      Note If the dry run fails, view the node on which the failure occurs and the reason for the failure in the logs. Then, modify the configuration of the node and try again.
    If the dry run is successful, you can also view the statistics data that is synchronized to the t_order_report_daily table in the MySQL database.
  2. Enable scheduling for the task flow.
    1. Click the blank area on the canvas.
    2. Click the Task Flow Information tab.
    3. In the Scheduling Settings section, turn on Enable Scheduling. For more information, see Configure a task flow.
  3. Publish the task flow. After the task flow is published, it automatically runs based on the specified scheduling cycle.
    1. Click Publish in the upper-left corner of the canvas.
    2. In the Publish dialog box, set the Remarks parameter and click OK.