This topic describes how to use code in applications to extract data in an efficient manner.

Overview

You can use code or the data export tool to read data from PolarDB-X at a time. This process is called data extraction. Data can be extracted in the following scenarios:
  • The data export tool is used to export full data to the downstream system. PolarDB-X supports multiple data export tools.
  • When data is processed in PolarDB-X or query results are returned to users for browsing in batches, external tools cannot be used and code must be used in PolarDB-X to extract full data.

This topic describes how to use code in PolarDB-X to extract data in an efficient manner. You can extract full data at a time or perform paged queries.

Extract full data

In most cases, the SQL statements that extract full data do not contain the shard keys of tables. If you perform a full table scan to extract full data, the amount of read data is positively correlated with the duration required to extract data. You can use hints to send the SQL statements to physical shards. This way, data can be pulled from the shards. This saves network or connection resources. The statements in the following code block are written by using Java code. For more information, see Hints.

public static void extractData(Connection connection, String logicalTableName, Consumer<ResultSet> consumer)
    throws SQLException {

    final String topology = "show topology from {0}";
    final String query = "/*+TDDL:NODE({0})*/select * from {1}";

    try (final Statement statement = connection.createStatement()) {
        final Map<String, List<String>> partitionTableMap = new LinkedHashMap<>();
        // Get partition id and physical table name of given logical table
        try (final ResultSet rs = statement.executeQuery(MessageFormat.format(topology, logicalTableName))) {
            while (rs.next()) {
                partitionTableMap.computeIfAbsent(rs.getString(2), (k) -> new ArrayList<>()).add(rs.getString(3));
            }
        }
        // Serially extract data from each partition
        for (Map.Entry<String, List<String>> entry : partitionTableMap.entrySet()) {
            for (String tableName : entry.getValue()) {
                try (final ResultSet rs = statement
                    .executeQuery(MessageFormat.format(query, entry.getKey(), tableName))) {
                    // Consume data
                    consumer.accept(rs);
                }
            }
        }
    }
}

Paged queries

When you want to display data in a list for users, the data must be displayed by page to increase the speed at which pages are loaded. This prevents excessive redundant information from being displayed. This query method is called paged query. Relational databases cannot return data in tables by segment. To perform paged queries in an efficient manner, you must write query statements based on the characteristics of your databases.

The following code block provides an example on paged queries in MySQL. In this example, limit offset and pageSize are used to perform the paged query.

select * from t_order where user_id = xxx order by gmt_create, id limit offset, pageSize

The values of the gmt_create column may be duplicate. The id column included in the order by clause is used to ensure that the result is sorted in the correct order.

Note The query method can be used as expected when the table size is small. If the t_order table can store hundreds of thousands of records, the larger number of pages indicates the lower speed at which data is queried. The speed may decrease to tens of milliseconds per query. If the data size increases to millions of records, data is queried from the table in seconds. If the data size keeps increasing, the time required to perform a query becomes excessive.
Issue analysis

Local secondary indexes (LSIs) are created on the user_id and gmt_create columns. The total number of data records that are scanned at a time is the sum of values specified by the OFFSET clause and the pageSize parameter because only the user_id column is specified in the WHERE clause. If the value specified in the OFFSET clause increases, the number of records that are scanned is close to the total number of records in the table. This increases the amount of time required to perform the query. In distributed databases, you cannot add data nodes to increase the throughput based on which an entire table is sorted.

Solution 1

The records for a page that you want to return start from the point at which the preceding page ends. This way, you do not need to include the OFFSET clause. Full table scans are prevented. The following code block provides an example on paged queries performed based on the id column:

select * from t_order where id > lastMaxId order by id limit pageSize

In this example, no conditions are specified for the first page that is queried. When the second page is queried, the maximum value of the id column queried for the first page is specified in the query condition. This way, the database identifies the value of the lastMaxId parameter in the index column and returns the required records. The pageSize parameter specifies the number of records that you want to return. Similar rules take effect on other pages. This method is efficient.

Note If the id column is the primary key or a unique key, you can use Solution 1 to run efficient paged queries. If the id column contains duplicate values, the records that contain the duplicate values may not be returned.
Solution 2

In MySQL, you can use row constructor expressions to compare the values of multiple columns. For more information about row constructor expressions, see Row Constructor Expression. In PolarDB-X, you can also use row constructor expressions to compare the values of multiple columns.

(c2,c3) > (1,1) 
equivalent to 
c2 > 1 OR ((c2 = 1) AND (c3 > 1))

You can use the following syntax to perform paged queries:

select * from t_order 
where user_id = xxx and (gmt_create, id) > (lastMaxGmtCreate, lastMaxId)
order by user_id, gmt_create, id limit pageSize

When the first page is queried, no condition is specified. When the second page is queried, the maximum values of the gmt_create and id columns queried for the first page are specified in the query condition. Similar rules take effect on other pages. Row construction expressions can be used to handle duplicate values of the gmt_create column.

Note In the example, a composite index is created on the user_id and gmt_create columns and the user_id column can be added to the order by clause. This way, the optimizer can sort data by using the index column. This improves query performance. If a row constructor expression contains null, the result of this expression is null. In this case, the OR expression must be used in the SQL statement that contains the row constructor expression. In PolarDB-X, row constructor expressions are used for column pruning only when row constructor expressions contain only shard keys. In other scenarios, OR expressions are also required.

The following code block provides the best practice for using paged queries on a PolarDB-X instance.

-- lastMaxGmtCreate is not null 
select * from t_order 
where user_id = xxx 
and (
      (gmt_create > lastMaxGmtCreate) 
      or ((gmt_create = lastMaxGmtCreate) and (id > lastMaxId))
    )
order by user_id, gmt_create, id limit pageSize

-- lastMaxGmtCreate is null
select * from t_order 
where user_id = xxx 
and (
      (gmt_create is not null)
      or (gmt_create is null and id > lastMaxId)
    )
order by user_id, gmt_create, id limit pageSize