Optimize the way in which full data is extracted - PolarDB

Overview

Data extraction is the process to read full data from PolarDB-X at a time by using code or a data export tool. Data can be extracted in the following scenarios:

All the data is exported to the downstream system by using a data export tool. PolarDB-X supports multiple data export tools.
When data is processed in PolarDB-X or multiple query results are returned to users at a time, external tools cannot be used. In this case, you must extract full data in PolarDB-X by using code.

This topic describes how to use code in PolarDB-X to extract data in an efficient manner. You can extract full data at a time or perform paged queries.

Extract full data

In most cases, the SQL statements that extract full data do not contain the shard keys of tables. If you perform a full table scan to extract full data, the amount of read data is positively correlated with the duration required to extract data. You can use hints to send the SQL statements to physical shards and pull data from the shards. This saves network or connection resources. The following Java code provides an example on how to use hints to extract full data. For more information, see How to use hints.

public static void extractData(Connection connection, String logicalTableName, Consumer<ResultSet> consumer)
    throws SQLException {

    final String topology = "show topology from {0}";
    final String query = "/*+TDDL:NODE({0})*/select * from {1}";

    try (final Statement statement = connection.createStatement()) {
        final Map<String, List<String>> partitionTableMap = new LinkedHashMap<>();
        // Get partition id and physical table name of given logical table
        try (final ResultSet rs = statement.executeQuery(MessageFormat.format(topology, logicalTableName))) {
            while (rs.next()) {
                partitionTableMap.computeIfAbsent(rs.getString(2), (k) -> new ArrayList<>()).add(rs.getString(3));
            }
        }
        // Serially extract data from each partition
        for (Map.Entry<String, List<String>> entry : partitionTableMap.entrySet()) {
            for (String tableName : entry.getValue()) {
                try (final ResultSet rs = statement
                    .executeQuery(MessageFormat.format(query, entry.getKey(), tableName))) {
                    // Consume data
                    consumer.accept(rs);
                }
            }
        }
    }
}

Perform paged queries

If you want to display the data in a list to users, the data must be displayed by page to make page loading faster. This prevents excessive redundant information from being displayed. This query method is called paged query. Relational databases cannot return data in tables by segment. To perform paged queries in an efficient manner, you must write query statements based on the characteristics of your databases.

The following code block provides an example of paged queries in MySQL. In this example, the paged query is performed by using limit offset and pageSize.

select * from t_order where user_id = xxx order by gmt_create, id limit offset, pageSize

The values of the gmt_create column may be duplicate. The id column included in the order by clause is used to ensure that the result is sorted in the correct order.

Note The query method works as expected when the table size is small. If the t_order table stores hundreds of thousands of records, the increasing number of pages slows down the data query. The speed may decrease to tens of milliseconds per query. When the data volume increases to millions of records, data is queried from the table in seconds. If the data volume continues to increase, the time for a query becomes unacceptable.

Issue analysis

When local secondary indexes (LSIs) are created on the user_id and gmt_create columns, the total number of data records scanned at a time is the sum of the values specified by the offset and pageSize parameters because only the user_id column is specified in the WHERE clause. If the value specified for the offset parameter increases, the number of records scanned is close to the total number of records in the table. This increases the amount of time required for the query. In distributed databases, you cannot increase the throughput of table sorting by adding data nodes.

Solution 1

You can specify that the query results on a page start from the point at which the previous page ends. This way, you can prevent a scan of the full table without the need to specify the offset parameter. The following code block provides an example of paged queries based on the ID column:

select * from t_order where id > lastMaxId order by id limit pageSize

In this example, no conditions are specified for the first page queried. When the second page is queried, the maximum value of the id column queried for the first page is specified in the query condition. This way, the database identifies the value of the lastMaxId parameter in the index column and returns the query results. The pageSize parameter specifies the number of records that you want to return.

Note If the ID column is the primary key or a unique key, you can use Solution 1 to perform paged queries. However, the solution has a shortcoming. If the ID column contains duplicate values, the records that contain the duplicate values may not be returned.

Solution 2

In MySQL, you can use Row Constructor Expression to compare the values of multiple columns (This also applies to PolarDB-X).

(c2,c3) > (1,1) 
is equivalent to 
c2 > 1 OR ((c2 = 1) AND (c3 > 1))

You can use the following syntax to perform paged queries:

select * from t_order 
where user_id = xxx and (gmt_create, id) > (lastMaxGmtCreate, lastMaxId)
order by user_id, gmt_create, id limit pageSize

When the first page is queried, no condition is specified. When the second page is queried, the maximum values of the gmt_create and id columns queried for the first page are specified in the query condition. Similar rules take effect on other pages. Row construction expressions can be used to handle duplicate values of the gmt_create column.

Note In the example, a composite index is created on the user_id and gmt_create columns, and the user_id column is added to the order by clause. This way, the optimizer can sort data by using the index column, thus improving the query performance. If a Row Constructor Expression contains null, the result of this expression is null. In this case, the OR expression must be used in the SQL statement that contains the Row Constructor Expression. In PolarDB-X, Row Constructor Expressions are used for column pruning only when Row Constructor Expressions contain only shard keys. In other scenarios, OR expressions are also required.

The following code block provides the best practice for performing paged queries on a PolarDB-X instance.

-- lastMaxGmtCreate is not null 
select * from t_order 
where user_id = xxx 
and (
      (gmt_create > lastMaxGmtCreate) 
      or ((gmt_create = lastMaxGmtCreate) and (id > lastMaxId))
    )
order by user_id, gmt_create, id limit pageSize

-- lastMaxGmtCreate is null
select * from t_order 
where user_id = xxx 
and (
      (gmt_create is not null)
      or (gmt_create is null and id > lastMaxId)
    )
order by user_id, gmt_create, id limit pageSize