AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to submit XIHE bulk synchronous parallel (BSP) SQL jobs by using the SQL editor or Java Database Connectivity (JDBC). This topic describes the applicable scenarios, submission methods, and configuration parameters of XIHE BSP SQL development, and provides answers to some frequently asked questions.
Prerequisites
A job resource group is created in the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
Scenarios
XIHE BSP SQL jobs are executed by using the XIHE BSP engine. XIHE BSP SQL development is suitable for extract-transform-load (ETL) jobs, large queries, and burst low-priority queries. For information about the XIHE BSP engine, see the "Compute engines" section of the Functions and features topic.
ETL jobs
The following figure shows a typical ETL process.
In most cases, operations such as data cleansing and data transforming that are performed during an ETL process from a data source to the Application Data Service (ADS) layer involve large amounts of data and requires an extended period of time to complete. For ETL jobs, response speeds are not a major concern, but the system must provide capabilities like automatic retry to ensure reliability. The XIHE BSP engine is an ideal choice due to its high throughput, reliability, and cost-efficiency.
After an ETL job is complete, data applications can query data from the ADS layer, which may require a second-level or millisecond-level response time. In this case, the faster XIHE massively parallel processing (MPP) engine is more suitable.
Large queries
In XIHE MPP mode, large queries may encounter out-of-memory (OOM) errors or query exceptions. Scaling up resources can resolve these issues but is not cost-effective. In this case, you can use the XIHE BSP engine to perform large queries in a job resource group. It provides higher reliability by caching intermediate results of queries in disks, and enables better cost-efficiency with on-demand resource requests and billing.
Burst low-priority queries
Low-priority queries do not need to be returned at a fast speed. However, when low-priority queries are abruptly submitted in large numbers, the system may not have enough resources for the queries. As a result, the execution of other queries may be affected. To resolve these issues, you can use the XIHE BSP engine to perform these queries in a job resource group.
Limits
You cannot write Hudi tables in XIHE BSP mode.
You cannot read and write Delta tables in XIHE BSP mode.
Develop a XIHE BSP job
You can use one of the following methods to develop a XIHE BSP job:
Use the SQL editor to submit a XIHE BSP job
Use JDBC or a MySQL client to synchronously submit a XIHE BSP job
Use JDBC or a MySQL client to asynchronously submit a XIHE BSP job
Configure a XIHE BSP job
You can configure parameters for XIHE BSP jobs, such as the amount of resources, timeout period, and priority.
Configuration methods
The BSP job configuration can take effect for a single job, all jobs within a single job resource group, or all jobs within a cluster.
Effective for a single job
To make the BSP job configuration effective for a single job, add the /*+ resource_group=<resource_group_name>,<config_name>*/
hint to an SQL statement.
In the hint, resource_group_name
specifies the name of the resource group. For more information about config_name
, see the Parameter column in the "Configuration parameters" section of this topic.
Example: Configure up to 20 AnalyticDB compute units (ACUs) of available resources for a job that is run in the job resource group bsptest.
/*+ resource_group=bsptest,elastic_job_max_acu=20*/SELECT count(*) from test_db.ods_hudi;
Effective within a job resource group
To make the BSP job configuration effective for all jobs within a job resource group, execute the SET adb_config <resource_group_name>.<config_name>
statement.
In the hint, resource_group_name
specifies the name of the resource group. For more information about config_name
, see the Parameter column in the "Configuration parameters" section of this topic.
Example: Configure up to 20 ACUs of available resources for each job that is run in the job resource group bsptest.
SET adb_config bsptest.elastic_job_max_acu=20;
Check whether the configuration takes effect
To check whether the configuration takes effect for all jobs within the resource group, execute the SHOW ADB_CONFIG KEY=<resource_group_name>.<config_name>
statement.
Effective within a cluster
To make the BSP job configuration effective for all jobs within a cluster, execute the SET adb_config <config_name>
statement. For more information about config_name
, see the Parameter column in the "Configuration parameters" section of this topic.
Example: Configure up to 20 ACUs of available resources for each job that is run within a cluster.
SET adb_config elastic_job_max_acu=20;
Check whether the configuration takes effect
To check whether the configuration takes effect for all jobs within the cluster, execute the SHOW ADB_CONFIG KEY=<config_name>
statement.
Configuration parameters
The following table describes the parameters that you can configure for XIHE BSP jobs.
Category | Parameter | Description | Default value |
Resources |
| The maximum number of ACUs that can be used per XIHE BSP job for AppMaster nodes and compute nodes. The value of this parameter cannot exceed the number of ACUs that are specified for the maximum computing resources of the resource group. Note AppMaster nodes are responsible for parsing, scheduling, and executing queries. | 9 |
Timeout period |
| The timeout period of the BSP job. Unit: milliseconds. If the BSP job is executed for a period of time that is longer than the value of this parameter, the job is automatically canceled. | 7200000 |
Priority |
| The priority of the BSP job. Valid values: HIGH, NORMAL, LOW, and LOWEST. For information about priority queues, see Priority queues of job resource groups. | NORMAL |
FAQ
How do I check the status of a BSP job?
If you submit a BSP job by using the SQL editor, you can choose
and click the Execution Records tab to view the status of the job.If you submit a BSP job without using the SQL editor, you can execute the following statement to query the status of the job from the
information_schema.kepler_meta_elastic_job_list
table:SELECT status FROM information_schema.kepler_meta_elastic_job_list WHERE process_id='<job_id>';
NoteThe
information_schema.kepler_meta_elastic_job_list
table stores up to 1,000 BSP jobs that are submitted within the last 30 days. You can perform further statistics and analysis such as GROUP BY aggregation on the table. The following sample statement shows how to query the number of BSP jobs in each state:SELECT status,count(*) FROM information_schema.kepler_meta_elastic_job_list GROUP BY status;
How do I determine whether to submit a BSP job synchronously or asynchronously?
The only difference between synchronous submission and asynchronous submission is whether the client needs to wait for the query execution to complete.
Asynchronous submission has the following limits:
A result set can contain up to 10,000 rows of data.
The system can retain up to 1,000 result sets, including the download URLs of CSV files, for up to 30 days.
We recommend that you submit BSP jobs asynchronously for queries that consume large amounts of computing power, require an extended period of time to complete, but return small result sets. Examples: INSERT INTO SELECT
, INSERT OVERWRITE SELECT
, and CREATE TABLE AS SELECT
.