Data Management (DMS) allows you to use the task orchestration feature to configure cross-database Spark SQL nodes. You can write Spark SQL statements in cross-database Spark SQL nodes to synchronize and process data across databases in complex business scenarios.
Prerequisites
- MySQL: ApsaraDB RDS for MySQL, PolarDB for MySQL, MyBase for MySQL, PolarDB-X, AnalyticDB for MySQL, and MySQL databases that are not on Alibaba Cloud
- SQL Server: ApsaraDB RDS for SQL Server, MyBase for SQL Server, and SQL Server databases that are not on Alibaba Cloud
- PostgreSQL: ApsaraDB RDS for PostgreSQL, PolarDB for PostgreSQL, MyBase for PostgreSQL, AnalyticDB for PostgreSQL, and PostgreSQL databases that are not on Alibaba Cloud
- Oracle
- Db2
- MaxCompute
- Hologres
- OSS
Limits
Spark SQL nodes run based on the Spark engine. We recommend that you do not process more than 2 million data records in a single Spark SQL node at a time. Otherwise, the node efficiency may be decreased.
Scenarios
Cross-database Spark SQL nodes are used to synchronize and process data across databases.
- Cross-database data synchronization:
- Synchronize data from an online database to a data warehouse for data processing: In most cases, a large amount of data is generated in an online database. If you want to process and analyze the data in the online database, you can synchronize the data to a data warehouse that is designed for data processing and analytics. Then, you can analyze the data in the data warehouse.
- Synchronize the data from the data warehouse to the online database for data queries: After the data is processed and analyzed in the data warehouse, you can synchronize the data back to the online database. This way, you can collect statistics on and analyze data in your online application.
Example: The data of a consumer service platform is stored in a MySQL database. You want to collect statistics on and analyze transaction data, such as the number of transactions and transaction amounts, that is generated by the platform. In this case, you can synchronize the data from the MySQL database to an AnalyticDB for MySQL data warehouse for data statistics and analytics, and then synchronize the analyzed data back to the platform for online queries.
- To synchronize incremental consumption data from the MySQL database to the AnalyticDB
for MySQL data warehouse, execute the following Spark SQL statements:
INSERT INTO adb_dw.orders SELECT * FROM mysql_db.orders WHERE dt>${bizdate} AND dt<=${tomorrow}
- To synchronize the processed data from the AnalyticDB for MySQL data warehouse to
the MySQL database, execute the following Spark SQL statements:
INSERT INTO mysql_db.orders_month SELECT * FROM adb_dw.orders_month WHERE dt=${bizdate}
- Cross-database data processing: You can write data in multiple databases to an online
database. Then, you can query data in your online application.
Example: The business data of an e-commerce enterprise is stored in an online MySQL database. The employee data is stored in an Oracle database of the human resource system. If the enterprise wants to collect sales statistics by department, joint queries on departments, employees, and business data are required. The following Spark SQL statements are used to join the sales table in the mysql_db online database with the users table in the oracle_db database, group the joint data by department to calculate the number of transactions and transaction amounts, and then write the processed data to the mysql_db online database:
INSERT INTO mysql_db.dept_sales SELECT dept_name, trade_date, COUNT(*) cnt, SUM(amt) amt FROM mysql_db.sales t1 JOIN oracle_db.users t2 ON t1.emp_id=t2.id WHERE t1.trade_date=${bizdate} GROUP BY t2.dept_name
Benefits
- Cross-database data processing: Cross-database Spark SQL nodes allow you to execute SQL statements to process data from different databases. DMS provides a comprehensive data ecosystem and supports various data sources by adding extension plug-ins.
- Efficient data processing: DMS delivers a high speed in processing more than 100,000 data records at a time.
- Support for Spark SQL syntax: Cross-database Spark SQL nodes are deployed based on Spark 3.1.2 and support all syntax and native functions of Spark 3.1.2. The native functions include aggregate functions, window functions, array functions, map functions, date and timestamp functions, and JSON functions. For more information, see Spark SQL Guide.
- Compatibility with standard SQL: DMS allows you to execute standard SQL statements to synchronize and process data across databases.
- Serverless technologies: Cross-database Spark SQL nodes provide a serverless compute service that uses the Spark engine to process data. You do not need to purchase computing resources in advance or maintain resources. You can use these nodes at zero O&M or upgrade costs.
Features
- Cross-database Spark SQL nodes can synchronize and process millions of data records.
- Cross-database Spark SQL nodes support all syntax and functions of Spark 3.1.2. For more information, see Spark SQL Guide.
- Cross-database Spark SQL nodes support the following SQL statements: CREATE TABLE, CREATE SELECT, DROP TABLE, INSERT, INSERT SELECT, ALTER TABLE, TRUNCATE, SET, ANALYZE, and MSCK REPAIR.
- Cross-database Spark SQL nodes support data files that are stored in the CSV, JSON, Parquet, or ORC format in Object Storage Service (OSS) buckets.
- Cross-database Spark SQL nodes do not support the SELECT statement.
- Cross-database Spark SQL nodes do not support the DELETE or UPDATE statement.
- Cross-database Spark SQL nodes do not support the CREATE DATABASE or DROP DATABASE statement.