FAQ and solutions for writing and querying data in AnalyticDB for MySQL - AnalyticDB

This topic describes common issues and solutions for writing and querying data in AnalyticDB for MySQL.

Note

If a product edition is not specified in an FAQ, the question applies only to the AnalyticDB for MySQL Data Warehouse Edition.

FAQ overview

Can I query data in Hudi tables using JDBC in a Data Lakehouse Edition cluster?
Can I read data from Hudi tables in OSS using a Data Lakehouse Edition cluster?
Does a Data Lakehouse Edition cluster support automatic switchover between online and offline jobs?
How do I choose between XIHE MPP and XIHE BSP to run jobs in a Data Lakehouse Edition cluster?
How do I view the status of a XIHE BSP job in a Data Lakehouse Edition cluster?
How can I implement resource isolation to reduce interference between SQL tasks?
How do I handle an excessive number of IN conditions?
How do I resolve the "Query exceeded maximum time limit of 1800000.00ms" error when I query data?
How do I resolve the "Out of Memory Pool size pre cal. available 0 require 3" error when I query data?
How do I resolve the "STORAGE_INDEX_ERROR, msg:Index does not exist" error when I query data?
How do I resolve the "multi-statement be found" error when I use the multi-statement feature to run multiple SQL statements consecutively?
How do I troubleshoot truncated time values in query results?
How do I resolve errors with the AES_ENCRYPT built-in function?
Why do my query results fluctuate?
Why does an ORDER BY query on a single table take a long time?
Why is the sum of table scan rows in the execution plan not equal to the total number of scanned rows in the query?
Why does using INSERT OVERWRITE on a table without a primary key result in duplicate data?
Why do I get a "Column 'XXX' not in GROUP BY clause" error when I run SELECT * FROM TABLE GROUP BY KEY?
Why is the data in the original table not overwritten after I run an INSERT OVERWRITE SELECT statement?
Is there a limit on the number of values in an IN operator when query results are returned in JSON format?
Can I use GZIP-compressed CSV files in OSS as an external data source?
Is INSERT ON DUPLICATE KEY supported?
Is Join supported in UPDATE statements?
Can I set variables in SQL?
Can I use the Logstash plugin to bulk insert data with an INSERT ON DUPLICATE KEY UPDATE statement?
What are the prerequisites for loading a built-in dataset?
How can I tell if a built-in dataset is loaded successfully?
What should I do if loading a built-in dataset fails or takes a long time?
How can a standard database account use a built-in dataset?
How do I test a built-in dataset after it is loaded?

Do Enterprise Edition, Basic Edition, and Data Lakehouse Edition clusters support querying data from Hudi tables using JDBC?

Yes. After you create a Hudi table in an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster, you can directly query data from the Hudi table using JDBC.

Can I read data from Hudi tables in OSS using an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster?

Yes. For more information about how to read data from Hudi tables in OSS using an external table, see Import data using an external table in a Data Lakehouse Edition cluster.

Does an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster support automatic switchover between XIHE Massively Parallel Processing (MPP) jobs and XIHE BSP jobs?

When you submit a job, you must manually specify whether to submit it to an Interactive resource group or a Job resource group. This determines whether the job is a XIHE MPP job or a XIHE BSP job.

How do I choose between XIHE MPP and XIHE BSP to run jobs in an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster?

By default, XIHE BSP submits jobs asynchronously. The difference between synchronous and asynchronous submission is whether the client needs to wait for the query to finish running.

Asynchronous submission has the following limits:

A maximum of 10,000 rows can be returned in a result set.
Result sets, including their download links for CSV files, are stored for a maximum of 30 days. A maximum of 1,000 result sets can be stored.

Use asynchronous submission for queries that have a long running time and require large amounts of computation but produce a small result set, such as INSERT INTO SELECT, INSERT OVERWRITE SELECT, and CREATE TABLE AS SELECT.

How do I view the status of a XIHE BSP job in an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster?

If the XIHE BSP job was submitted from the Job Editor on the SQL Development page, you can view its status on the Execution Records tab.
If the XIHE BSP job was not submitted from the job editor, you can view its status using a built-in memory table. To do this, run the following statement:
```
SELECT status FROM information_schema.kepler_meta_elastic_job_list WHERE process_id='<job_id>';
```
To query the status statistics of XIHE BSP jobs, run the following statement:
```
SELECT status,count(*) FROM information_schema.kepler_meta_elastic_job_list GROUP BY status;
```

How can I implement resource isolation to reduce interference between SQL tasks?

Data Warehouse Edition clusters in elastic mode and Enterprise Edition, Basic Edition, and Data Lakehouse Edition clusters support resource groups. For more information about resource group types, see Resource groups for Data Warehouse Edition and Resource groups for Data Lakehouse Edition. You can create different types of resource groups and submit SQL jobs to the corresponding resource groups to achieve resource isolation.

How do I handle an excessive number of IN conditions?

AnalyticDB for MySQL limits the number of IN conditions. The default limit is 2,000. You can adjust this limit as needed.

Note

The number of IN conditions cannot exceed 5,000. Exceeding this limit affects performance.

For example, to set the limit on the number of IN conditions to 3,000, run the following statement:

SET ADB_CONFIG max_in_items_count=3000;

How do I resolve the "Query exceeded maximum time limit of 1800000.00ms" error when I query data?

AnalyticDB for MySQL sets a timeout period for SQL queries. This error occurs because the SQL query exceeded the default timeout of 1,800,000.00 ms. You can configure the query timeout for a single query or for all queries in a cluster. The following examples show how to configure the query timeout:

Effective for a single query.

/*+ QUERY_TIMEOUT=xxx */SELECT count(*) FROM t;

This applies to all queries across the entire cluster.
```
SET ADB_CONFIG QUERY_TIMEOUT=xxx;
```
For more information, see Common configuration parameters.

How do I resolve the "Out of Memory Pool size pre cal. available 0 require 3" error when I query data?

Cause: A large SQL statement may be running in your AnalyticDB for MySQL cluster. Such statements consume a large amount of memory, which causes an out of memory error when you run the current SQL query.

Solution: Wait a few minutes and run the SQL statement again.

How do I resolve the "STORAGE_INDEX_ERROR, msg:Index does not exist" error when I query data?

Cause: This error is caused by a known bug in some earlier kernel versions of AnalyticDB for MySQL. Under specific conditions, this bug can cause inconsistencies in a table's index metadata. As a result, the index cannot be found during a query, which triggers a STORAGE_INDEX_ERROR exception.

Solution: Upgrade the kernel version of your cluster. This bug is fixed in the new kernel version.

Important

Upgrade duration: A kernel version upgrade typically takes about 30 minutes.
Service interruption: During the upgrade, your cluster experiences a transient disconnection that lasts for a few seconds.
Recommendation: We strongly recommend that you perform the upgrade during a planned maintenance window or off-peak hours. Make sure your application has an automatic reconnection mechanism to handle the transient disconnection smoothly.

How do I resolve the "multi-statement be found" error when I use the multi-statement feature to run multiple SQL statements consecutively?

Only clusters with a kernel version of 3.1.9.3 or later support the multi-statement feature. First, check whether your cluster's kernel version meets this requirement. If the kernel version is earlier than 3.1.9.3, contact technical support to upgrade the version. If the kernel version is 3.1.9.3 or later but the error persists, the multi-statement feature may not be enabled on the client.

For example, when you use a MySQL JDBC client to connect to a cluster, you must not only run the SET ADB_CONFIG ALLOW_MULTI_QUERIES=true; command to manually enable the multi-statement feature, but also set the allowMultiQueries JDBC connection property to true.

How do I troubleshoot truncated time values in query results?

First, verify the results using a MySQL client. If the MySQL client displays the time values correctly, check whether the other client software you are using has processed the results in a special way.

How do I resolve errors with the AES_ENCRYPT built-in function?

The following statement causes an error.

 SELECT CONVERT(AES_DECRYPT(AES_ENCRYPT('ABC123','key_string'),'key_string'),char(10));

The cause is that in the AES_ENCRYPT(varbinary x, varchar y) statement, the data type of x must be varbinary. The following code provides an example of a valid SQL statement:

SELECT CONVERT(AES_DECRYPT(AES_ENCRYPT(CAST('ABC123' AS VARBINARY), 'key_string'), 'key_string'),char(10));

Why do my query results fluctuate?

If you confirm that the data has not been updated, the query results may fluctuate for the following reasons:

LIMIT without sorting: AnalyticDB for MySQL is a distributed database. Queries are run on multiple nodes in a multi-threaded manner. If some threads return enough rows to meet the LIMIT condition, the query stops. Therefore, the result of a query with a LIMIT clause but no sorting is random because the system cannot guarantee that the results are returned from a fixed set of threads.
Aggregate query with grouping: If a field in the SELECT clause does not have an aggregate function and is not included in the GROUP BY clause, a random value is returned for that field.

If the issue persists, contact technical support.

Why does an ORDER BY query on a single table take a long time?

Cause: The data is not sorted in the storage layer and is stored dispersedly. This causes a large amount of invalid data to be read, which increases the query time.

Solution: Add a clustered index to the field used for the ORDER BY clause. After you add a clustered index, the data is pre-sorted in the storage layer. When you run an ORDER BY query, only a small amount of data needs to be read, which improves query performance. For more information about how to add a clustered index, see Add a clustered index.

Note

A table can have only one clustered index. If another field already has a clustered index, you must delete it before you can add a clustered index to the ORDER BY field.
Adding a clustered index to a large table increases the time required for BUILD tasks, which in turn affects the CPU utilization of storage nodes.

Why is the sum of table scan rows in the execution plan not equal to the total number of scanned rows in the query?

This issue is usually caused by the creation of a replicated table. In AnalyticDB for MySQL, a copy of the data from a replicated table is saved in each shard. When you query a replicated table, the scan count is repeatedly calculated.

Why does using INSERT OVERWRITE on a table without a primary key result in duplicate data?

AnalyticDB for MySQL does not automatically remove duplicates from tables that do not have a primary key.

Why do I get a "Column 'XXX' not in GROUP BY clause" error when I run SELECT * FROM TABLE GROUP BY KEY?

The SELECT * FROM table GROUP BY key statement is not supported for group queries. You must explicitly list all fields. The following code provides an example of a valid SQL statement:

SELECT nation.name FROM nation GROUP BY nation.nationkey

Why is the data in the original table not overwritten after I run an INSERT OVERWRITE SELECT statement?

Cause: The INSERT OVERWRITE statement overwrites data in the same partition. The new partition overwrites the old one. If the SELECT result is empty, it means there is no new partition. Therefore, the existing data in the original table is not overwritten.

Solution: Modify the INSERT OVERWRITE SELECT statement to ensure that the SELECT result is not empty. This allows the data in the original table to be overwritten.

Is there a limit on the number of values in an IN operator when query results are returned in JSON format?

For AnalyticDB for MySQL clusters with a kernel version of 3.1.4 or earlier, the number of values specified in an IN operator cannot exceed 16. For clusters with a kernel version later than 3.1.4, there is no limit. For information about how to view the kernel version of a cluster, see How do I view the kernel version of a cluster?.

Can I use GZIP-compressed CSV files in OSS as an external data source?

AnalyticDB for MySQL supports GZIP-compressed CSV files on OSS as an external data source, and you must add compress_type=gzip to the external table definition. For the OSS external table syntax, see Non-partitioned OSS external tables.

Is INSERT ON DUPLICATE KEY supported?

AnalyticDB for MySQL currently supports only value updates, not arithmetic expressions.

Is Join supported in UPDATE statements?

This feature is supported only in AnalyticDB for MySQL clusters with a kernel version of 3.1.6.4 or later. For more information, see UPDATE.

Can I set variables in SQL?

AnalyticDB for MySQL does not support setting variables in SQL.

Can I use the Logstash plugin to bulk insert data with an `INSERT ON DUPLICATE KEY UPDATE` statement?

Yes. When you use an INSERT ON DUPLICATE KEY UPDATE statement to bulk insert data, you do not need to add ON DUPLICATE KEY UPDATE after each VALUES() clause. You only need to add it after the last VALUES() clause.

For example, to bulk insert three records into the student_course table, run the following statement:

INSERT INTO student_course(`id`, `user_id`, `nc_id`, `nc_user_id`, `nc_commodity_id`, `course_no`, `course_name`, `business_id`)
VALUES(277943, 11056941, '1001EE1000000043G2T5', '1001EE1000000043G2TO', '1001A5100000003YABO2', 'kckm303', 'Industrial Accounting Practice V9.0--77', 'accounting'),
(277944, 11056943, '1001EE1000000043G2T5', '1001EE1000000043G2TO', '1001A5100000003YABO2', 'kckm303', 'Industrial Accounting Practice V9.0--88', 'accounting'),
(277945, 11056944, '1001EE1000000043G2T5', '1001EE1000000043G2TO', '1001A5100000003YABO2', 'kckm303', 'Industrial Accounting Practice V9.0--99', 'accounting')
ON DUPLICATE KEY UPDATE
course_name = 'Industrial Accounting Practice V9.0--77',
business_id = 'accounting';

What are the prerequisites for loading a built-in dataset?

The cluster must have at least 24 AnalyticDB Compute Units (ACUs) of reserved storage resources, and the user_default resource group must have at least 16 ACUs of reserved computing resources.

How can I tell if a built-in dataset is loaded successfully?

On the Job Development > SQL Development page, you can view the loading progress. The dataset is loaded successfully if the Load Built-in Dataset button is grayed out with a icon, and the ADB_SampleData_TPCH database and its tables are visible on the Databases and Tables tab.

What should I do if loading a built-in dataset fails or takes a long time?

First, run the DROP TABLE table_name; SQL statement to delete all tables in the database. After the tables are deleted, run the DROP DATABASE ADB_SampleData_TPCH; SQL statement to delete the database for the built-in dataset. After the ADB_SampleData_TPCH database is deleted, load the dataset again.

How can a standard database account use a built-in dataset?

The built-in dataset feature follows the permission management rules of AnalyticDB for MySQL. Even if a built-in dataset is loaded in the cluster, a standard database account cannot use it without permissions for the ADB_SampleData_TPCH database. A privileged account must grant the required permissions to the standard account. The grant statement is:

GRANT select ON ADB_SampleData_TPCH.* TO <user_name>;

How do I test a built-in dataset after it is loaded?

After the dataset is loaded, AnalyticDB for MySQL provides corresponding query scripts by default. You can run the sample query statements on the Scripts tab of the SQL Development page. For more information about the query statements, see TPC-H test dataset.

Important

To ensure data integrity, we recommend that you run only query operations on the ADB_SampleData_TPCH database. If the dataset loading status becomes abnormal due to DDL or DML changes, try deleting the ADB_SampleData_TPCH database and loading the dataset again.

FAQ overview

Do Enterprise Edition, Basic Edition, and Data Lakehouse Edition clusters support querying data from Hudi tables using JDBC?

Can I read data from Hudi tables in OSS using an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster?

Does an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster support automatic switchover between XIHE Massively Parallel Processing (MPP) jobs and XIHE BSP jobs?

How do I choose between XIHE MPP and XIHE BSP to run jobs in an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster?

How do I view the status of a XIHE BSP job in an Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster?

How can I implement resource isolation to reduce interference between SQL tasks?

How do I handle an excessive number of IN conditions?

How do I resolve the "Query exceeded maximum time limit of 1800000.00ms" error when I query data?

How do I resolve the "Out of Memory Pool size pre cal. available 0 require 3" error when I query data?

How do I resolve the "STORAGE_INDEX_ERROR, msg:Index does not exist" error when I query data?

How do I resolve the "multi-statement be found" error when I use the multi-statement feature to run multiple SQL statements consecutively?

How do I troubleshoot truncated time values in query results?

How do I resolve errors with the AES_ENCRYPT built-in function?

Why do my query results fluctuate?

Why does an ORDER BY query on a single table take a long time?

Why is the sum of table scan rows in the execution plan not equal to the total number of scanned rows in the query?

Why does using INSERT OVERWRITE on a table without a primary key result in duplicate data?

Why do I get a "Column 'XXX' not in GROUP BY clause" error when I run SELECT * FROM TABLE GROUP BY KEY?

Why is the data in the original table not overwritten after I run an INSERT OVERWRITE SELECT statement?

Is there a limit on the number of values in an IN operator when query results are returned in JSON format?

Can I use GZIP-compressed CSV files in OSS as an external data source?

Is INSERT ON DUPLICATE KEY supported?

Is Join supported in UPDATE statements?

Can I set variables in SQL?

Can I use the Logstash plugin to bulk insert data with an INSERT ON DUPLICATE KEY UPDATE statement?

What are the prerequisites for loading a built-in dataset?

How can I tell if a built-in dataset is loaded successfully?

What should I do if loading a built-in dataset fails or takes a long time?

How can a standard database account use a built-in dataset?

How do I test a built-in dataset after it is loaded?

Can I use the Logstash plugin to bulk insert data with an `INSERT ON DUPLICATE KEY UPDATE` statement?