Data Management (DMS) allows you to use the task orchestration feature to configure cross-database Spark SQL nodes. You can write Spark SQL statements in cross-database Spark SQL nodes to synchronize and process data across databases in complex business scenarios.

Prerequisites

One or more of the following databases are used:
  • MySQL: ApsaraDB RDS for MySQL, PolarDB for MySQL, MyBase for MySQL, PolarDB-X, AnalyticDB for MySQL, and third-party MySQL databases
  • SQL Server: ApsaraDB RDS for SQL Server, MyBase for SQL Server, and third-party SQL Server databases
  • PostgreSQL: ApsaraDB RDS for PostgreSQL, PolarDB for PostgreSQL, MyBase for PostgreSQL, AnalyticDB for PostgreSQL, and third-party PostgreSQL databases
  • Oracle
  • Db2
  • MaxCompute
  • Hologres
  • OSS

Scenarios

Cross-database Spark SQL nodes are used to synchronize and process data across databases.

  • Cross-database data synchronization:
    • Synchronize data from an online database to a data warehouse for data processing: In most cases, a large amount of data is generated in an online database. If you want to process and analyze the data in the online database, you can synchronize the data to a data warehouse that is designed for data processing and analytics. Then, you can analyze the data in the data warehouse.
    • Synchronize data from a data warehouse to an online database for data queries: After data is processed and analyzed in a data warehouse, you can synchronize the data back to the online database. This way, you can collect statistics on and analyze data in your online application.

    Example: The data of a consumer service platform is stored in a MySQL database. You want to collect statistics on and analyze transaction data, such as the number of transactions and transaction amounts, that is generated by the platform. In this case, you can synchronize the data from the MySQL database to an AnalyticDB for MySQL data warehouse for data statistics and analytics, and then synchronize the analyzed data back to the platform for online queries.

    • To synchronize incremental consumption data from the MySQL database to the AnalyticDB for MySQL data warehouse, execute the following Spark SQL statements:
      INSERT INTO adb_dw.orders
      SELECT * FROM mysql_db.orders
      WHERE dt>${bizdate} AND dt<=${tomorrow}
    • To synchronize the processed data from the AnalyticDB for MySQL data warehouse to the MySQL database, execute the following Spark SQL statements:
      INSERT INTO mysql_db.orders_month 
      SELECT * FROM adb_dw.orders_month 
      WHERE dt=${bizdate}
  • Cross-database data processing: You can write data in multiple databases to an online database. Then, you can query the data in your online application.

    Example: The business data of an e-commerce enterprise is stored in an online MySQL database. The employee data is stored in an Oracle database of the human resource system. If the enterprise wants to collect sales statistics by department, joint queries on departments, employees, and business data are required. The following Spark SQL statements are used to join the sales table in the mysql_db online database with the users table in the oracle_db database, group the joined data by department to calculate the number of transactions and transaction amounts, and then write the processed data to the mysql_db online database:

    INSERT INTO mysql_db.dept_sales 
    SELECT dept_name, trade_date, COUNT(*) cnt, SUM(amt) amt FROM mysql_db.sales t1 JOIN oracle_db.users t2 ON t1.emp_id=t2.id
    WHERE t1.trade_date=${bizdate} GROUP BY t2.dept_name

Benefits

  • Cross-database data processing: Cross-database Spark SQL nodes allow you to execute SQL statements to process data from different databases. DMS provides a comprehensive data ecosystem and supports various data sources by adding extension plug-ins.
  • Efficient data processing: DMS delivers a high speed in processing more than 100,000 data records at a time.
  • Support for Spark SQL syntax: Cross-database Spark SQL nodes are deployed based on Spark 3.1.2 and support all syntax and native functions of Spark 3.1.2. The native functions include aggregate functions, window functions, array functions, map functions, date and timestamp functions, and JSON functions. For more information, see Spark SQL documentation.
  • Compatibility with standard SQL: DMS allows you to execute standard SQL statements to synchronize and process data across databases.
  • Serverless technologies: Cross-database Spark SQL nodes provide a serverless compute service that uses the Spark engine to process data. You do not need to purchase computing resources in advance or maintain resources. You can use these nodes at zero O&M or upgrade costs.

Features

  • Cross-database Spark SQL nodes can synchronize and process millions of data records.
    Note
    • Cross-database Spark SQL nodes run based on the Spark engine. We recommend that you do not process more than 2 million data records in a single Spark SQL node at a time. Otherwise, the node efficiency may be decreased.
    • During peak hours, the real-time computing performance of cross-database Spark SQL nodes may be affected due to limited computing resources.
  • Cross-database Spark SQL nodes support the syntax and all native functions of Spark 3.1.2. For more information, see Spark SQL documentation.
  • Cross-database Spark SQL nodes support the following SQL statements: CREATE TABLE, CREATE SELECT, DROP TABLE, INSERT, INSERT SELECT, ALTER TABLE, TRUNCATE, SET, ANALYZE, and MSCK REPAIR.
    Note
    • The TRUNCATE statement applies only to Object Storage Service (OSS) tables.
    • Cross-database Spark SQL nodes do not support the following SQL statements: SELECT, DELETE, UPDATE, CREATE DATABASE, and DROP DATABASE.
  • Cross-database Spark SQL nodes support data files that are stored in the CSV, JSON, Parquet, or ORC format in OSS buckets.

Procedure

  1. Go to the DMS console V5.0.
  2. In the top navigation bar, click DTS. In the left-side navigation pane, choose Data Development > Task Orchestration.
  3. Click the name of the task flow that you want to manage to go to the details page of the task flow.
    Note For more information about how to create a task flow, see Overview.
  4. In the Task Type list on the left side of the canvas, drag the Cross-Database Spark SQL node to the canvas.
  5. Double-click the Cross-Database Spark SQL node.
  6. Optional: On the configuration tab of the cross-database Spark SQL node, click the Variable Setting tab in the right-side navigation pane to configure the variables that you want to reference in SQL statements. You can click the Info icon icon in the upper-right corner of the Variable Setting panel to view the tips about variable configurations.
    • Click the Node Variable tab and configure node variables. For more information, see Configure time variables.
    • Click the Task Flow Variable tab and configure task flow variables. For more information, see Configure time variables.
    • Click the Input Variables tab and view upstream variables, task status variables, and system variables.
  7. Optional: In the OSS Reference section, click Add OSS Reference. Set the parameters about the OSS bucket that you want to reference in Spark SQL statements and click Save. The following table describes the parameters in the OSS Reference section.
    Parameter Description
    Database The OSS bucket to be referenced in Spark SQL statements. Select the OSS bucket from the Database drop-down list. You can also enter a keyword to search for the OSS bucket.
    Note For more information about how to register an OSS bucket, see Register an OSS bucket.
    OSS Directory The path in which data is stored in the OSS bucket.
    Note
    • If the path that you specify does not exist, the system automatically creates the directory.
    • You can use a variable to specify a path. Example: /path/${foldername}.
    Alias in Spark SQL The alias of the OSS bucket to be referenced in Spark SQL statements. Default value: oss.
    Note The alias can be up to 32 characters in length and can contain digits, uppercase letters, lowercase letters, and underscores (_).
  8. Optional: In the Database Reference section, click Add Database Reference. Set the parameters about the database that you want to reference in Spark SQL statements. The following table describes the parameters in the Database Reference section. Then, click Save.
    Parameter Description
    Database Type The type of the database.
    Note To view the type of a database, go to the Home tab of DMS. In the left-side navigation pane, move the pointer over the database instance to which the database belongs.
    Database The database to be referenced in Spark SQL statements. Select the database from the Database drop-down list. You can also enter a keyword to search for the database.
    Note Make sure that you select a database that is not managed in Security Collaboration mode. If the information about the account that you use to log on to the selected database changes, the task may fail. The information includes the password and permissions. If the task fails, you must log on to the database again and save the node configurations.
    Alias in Spark SQL The alias of the database to be referenced in Spark SQL statements. The default value is the name of the database.
    Note The alias can be up to 32 characters in length and can contain digits, uppercase letters, lowercase letters, and underscores (_).
    To add another database, click the Add icon icon to the right of the database that you have added.
  9. Enter Spark SQL statements in the SQL editor and perform a test run.
    1. Enter Spark SQL statements in the SQL editor and click Save.
      For example, you want to read data from the ex_customer table in the qntext database and write the data to the test_table table in the qn_rds database. The alias of the qntext database is qn, and the alias of the qn_rds database is rds. In this case, enter the following SQL statement:
      INSERT INTO rds.test_table SELECT * FROM qn.ex_customer;
      Note
      • You can reference variables in the ${Variable name} format in SQL statements.
      • To preview the SQL statements that you enter, click Preview.
      • To check the validity of the SQL statements, click Check.
    2. Click Try Run.
      • If status SUCCEEDED appears in the last line of the logs, the test run is successful.
      • If status FAILED appears in the last line of the logs, the test run fails. In this case, view the node on which the failure occurs and the reason for the failure in the logs. Then, modify the configuration of the node and try again.

References