Data Management (DMS) allows you to use the task orchestration feature to configure cross-database Spark SQL nodes. You can write Spark SQL statements in cross-database Spark SQL nodes to synchronize and process data across databases in complex business scenarios.
Prerequisites
- One of the following types of databases are used:
- MySQL: ApsaraDB RDS for MySQL, PolarDB for MySQL, MyBase for MySQL, PolarDB for Xscale, AnalyticDB for MySQL, and third-party MySQL databases
- SQL Server: ApsaraDB RDS for SQL Server, MyBase for SQL Server, and third-party SQL Server databases
- PostgreSQL: ApsaraDB RDS for PostgreSQL, PolarDB for PostgreSQL, MyBase for PostgreSQL, AnalyticDB for PostgreSQL, and third-party PostgreSQL databases
- Oracle
- Db2
- MaxCompute
- Hologres
- OSS
- If you need to reference an OSS object in Spark SQL statements, we recommend that you obtain the path of the OSS object or register an OSS bucket in advance. For more information, see Register an OSS bucket.
Usage notes
- Cross-database Spark SQL nodes run on the Spark engine. We recommend that you do not process more than 2 million data records in a single Spark SQL node at a time. Otherwise, the node efficiency may be decreased.
- During peak hours, the real-time computing performance of cross-database Spark SQL nodes may be affected due to limited computing resources.
- If a cross-database Spark SQL task fails due to insufficient computing resources (CPU and memory), you need to go to the O&M page in Task Orchestration and manually rerun the failed task.
Common scenarios
Cross-database Spark SQL nodes are used to synchronize and process data across databases.
- Cross-database data synchronization:
- Synchronize data from an online database to a data warehouse for data processing. In most cases, a large amount of data is generated in an online database. If you want to process and analyze the data in the online database, you can synchronize the data to a data warehouse that is designed for data processing and analytics. Then, you can analyze the data in the data warehouse.
- Synchronize data from a data warehouse to an online database for data queries. After data is processed and analyzed in a data warehouse, you can synchronize the data back to the online database. This way, you can collect and analyze data in your online application.
Example: The data of a consumer service platform is stored in a MySQL database. You want to collect and analyze transaction data, such as the number of transactions and transaction amounts, that is generated by the platform. In this case, you can synchronize the data from the MySQL database to an AnalyticDB for PostgreSQL instance for data statistics and analytics, and then synchronize the analyzed data back to the platform for online queries.
- To synchronize incremental consumption data from the MySQL database to the AnalyticDB for PostgreSQL instance, execute the following Spark SQL statements:
INSERT INTO adb_dw.orders SELECT * FROM mysql_db.orders WHERE dt>${bizdate} AND dt<=${tomorrow}
- To synchronize the processed data from the AnalyticDB for PostgreSQL instance to the MySQL database, execute the following Spark SQL statements:
INSERT INTO mysql_db.orders_month SELECT * FROM adb_dw.orders_month WHERE dt=${bizdate}
- Cross-database data processing:
You can write data in multiple databases to an online database. Then, you can query the data in your online application.
Example: The business data of an e-commerce enterprise is stored in an online MySQL database. The employee data is stored in an Oracle database of the human resource system. If the enterprise wants to collect sales statistics by department, JOIN queries on the department, employee, and business data are required. The following Spark SQL statements are used to join the sales table in the mysql_db online database with the users table in the oracle_db database, group the joined data by department to calculate the number of transactions and transaction amounts, and then write the processed data to the mysql_db online database:
INSERT INTO mysql_db.dept_sales SELECT dept_name, trade_date, COUNT(*) cnt, SUM(amt) amt FROM mysql_db.sales t1 JOIN oracle_db.users t2 ON t1.emp_id=t2.id WHERE t1.trade_date=${bizdate} GROUP BY t2.dept_name
Features
- Cross-database data processing: Cross-database Spark SQL nodes allow you to execute SQL statements to process data from different databases. DMS provides a comprehensive data ecosystem and supports various data sources by using extensions.
- Efficient big data processing: DMS delivers a high speed in processing more than 100,000 data records at a time.
- Support for Spark SQL syntax: Cross-database Spark SQL nodes are deployed based on Spark V3.1.2 and support all syntax and native functions of Spark V3.1.2. The native functions include aggregate functions, window functions, array functions, map functions, date and timestamp functions, and JSON functions. For more information, see Spark SQL Guide.
- Compatibility with standard SQL: DMS allows you to execute standard SQL statements to synchronize and process data across databases.
- Serverless technologies: Cross-database Spark SQL nodes provide a serverless compute service that uses the Spark engine to process data. You do not need to purchase computing resources in advance or maintain resources. You can use these nodes at zero O&M or upgrade costs.
- Cross-database Spark SQL nodes support the following SQL statements: CREATE TABLE, CREATE SELECT, DROP TABLE, INSERT, INSERT SELECT, ALTER TABLE, TRUNCATE, SET, ANALYZE, and MSCK REPAIR. Note
- The TRUNCATE statement applies only to Object Storage Service (OSS) tables.
- Cross-database Spark SQL nodes do not support the following SQL statements: SELECT, DELETE, UPDATE, CREATE DATABASE, and DROP DATABASE.
- Cross-database Spark SQL nodes support data files that are stored in the CSV, JSON, Parquet, or ORC format in OSS buckets.
Procedure
- Log on to the DMS console V5.0.
- In the top navigation bar, click DTS. In the left-side navigation pane, choose .
- Click the name of the task flow that you want to manage to go to the details page of the task flow. Note For more information about how to create a task flow, see Overview.
- In the Task Type list on the left side of the canvas, drag the Cross-Database Spark SQL node to the canvas.
- Double-click the Cross-Database Spark SQL node.
- Optional:On the configuration tab of the cross-database Spark SQL node, click the Variable Setting tab in the right-side navigation pane to configure the variables that you want to reference in SQL statements. You can click the
icon in the upper-right corner of the Variable Setting panel to view the tips about variable configurations.
- Click the Node Variable tab and configure node variables. For more information, see Configure time variables.
- Click the Task Flow Variable tab and configure task flow variables. For more information, see Configure time variables.
- Click the Input Variables tab and view upstream variables, task status variables, and system variables.
- Optional:In the OSS Reference section, click Add OSS Reference. Set the parameters about the OSS bucket that you want to reference in Spark SQL statements and click Save. The following table describes the parameters in the OSS Reference section.
Parameter Required Description Database Yes The OSS bucket to be referenced in Spark SQL statements. Select the OSS bucket from the Database drop-down list. You can also enter a keyword to search for the OSS bucket. Important Make sure that you select a database that is not managed in Security Collaboration mode. If the information (such as password and permissions) about the account you use to log on to the selected database changes, the task may fail. If the task fails, you must log on to the database again and save the node configurations.OSS Directory Yes The path in which data is stored in the OSS bucket. Note- If the path that you specify does not exist, it will be created automatically.
- You can use a variable to specify a path. Example: /path/${foldername}.
Alias in Spark SQL Yes The alias of the OSS bucket to be referenced in Spark SQL statements. Default value: oss. Note The alias can be up to 32 characters in length and can contain digits, uppercase letters, lowercase letters, and underscores (_). - Optional:In the Database Reference section, click Add Database Reference. Set the parameters about the database that you want to reference in Spark SQL statements. Then, click Save. The following table describes the parameters in the Database Reference section.
To add one or more databases, click theParameter Required Description Database Type Yes The type of the database to be referenced in Spark SQL statements. Note To view the type of database, go to the SQL window. In the left-side navigation pane, move the pointer over the instance to which the database belongs.Database Yes Select the database from the Database drop-down list. You can also enter a keyword to search for the database. Important Make sure that you select a database that is not managed in Security Collaboration mode. If the information (such as password and permissions) about the account you use to log on to the selected database changes, the task may fail. If the task fails, you must log on to the database again and save the node configurations.Alias in Spark SQL Yes The alias of the database to be referenced in Spark SQL statements. The default value is the name of the database. Note The alias can be up to 32 characters in length and can contain digits, uppercase letters, lowercase letters, and underscores (_).icon to the right of the database to be added.
- Enter Spark SQL statements in the SQL editor and perform a test run.