Important We recommend that only users who are familiar with HBase API use secondary indexes because of the limitations on the open source HBase API. We recommend that you use secondary indexes by implementing the Lindorm SQL feature.

Overview

ApsaraDB for HBase supports rowkey (primary key) indexing, allowing you to sort rows based on the binary order of rowkeys. Based on rowkey indexing, row scans, prefix scans, and range scans can be performed efficiently. However, if you want to query a table based on columns other than rowkeys, you must use filters to narrow down the range of rowkeys. Otherwise, a full table is scanned. This wastes resources and increases response time.

Many solutions are provided for users to query HBase data in multiple dimensions. For example, you can create a secondary index table based on the columns to query. In this case, you must maintain the secondary index table. You can also export data to an external system, such as Solr or Elasticsearch, and then index the data. Search engines such as Solr and Elasticsearch provide powerful Ad Hoc Query capabilities.

However, we recommend that you do not use search engines such as Solr or Elasticsearch if you perform common queries based on few columns. This wastes resources. ApsaraDB for HBase Performance-enhanced Edition offers a solution that uses native secondary indexes to query HBase data with lower costs. Secondary indexes are built in ApsaraDB for HBase to support high throughput and performance. This solution has been tested in business such as Double 11 Shopping Festivals of Alibaba Group for many years. It is suitable for indexing large amounts of data.

The following sections introduce two important terms in indexes, and then describe how to use Data Definition Language (DDL) and Data Manipulation Language (DML) statements to manage secondary indexes. Examples are provided to describe advanced operations, such as how to sort rows based on the binary order of rowkeys and how to optimize the performance of queries. The last section lists the limits on using secondary indexes.

Terms

Execute the following statements to create a primary table and index tables:

create table 'dt' (rowkey varchar, c1 varchar, c2 varchar, c3 varchar, c4 varchar, c5 varchar constraint pk primary key(rowkey));
create index 'idx1' on 'dt' (c1);
create index 'idx2' on 'dt' (c2, c3, c4);
create index 'idx3' on 'dt' (c3) include (c1, c2, c4);
create index 'idx4' on 'dt' (c5) include (ALL);
Note The preceding SQL statements are for reference only. HBase Shell does not support SQL statements. To create a table, use the HBase Shell commands or the HBase API.

Index columns

Index columns refer to primary key columns in an index table. For example, column c1 in index idx1 and columns c2, c3, and c4 in index idx2 are index columns. Index columns and their order determine query patterns that are supported by the index table. A single-column index table contains only one index column. A multi-column index table contains multiple index columns.

Included columns

If the system cannot find an index column in an index table, the system scans the primary table. In a distributed search scenario, a primary table scan causes unnecessary remote procedure calls (RPCs). This significantly increases the response time. If you want to reduce the query time at the cost of storage, create an index table that contains included columns of the primary table. This way, the system does not perform a primary table scan if a query hits the index table. An index that contains included columns is also referred to as a covering index. In this example, indexes idx3 and idx4 contain included columns.

As your business develops, you may add new columns to the primary table. In this case, you can use the syntax supported by ApsaraDB for HBase Performance-enhanced Edition to treat all columns in the primary table as included columns. In this example, idx4 is created. The index table automatically covers all the columns of the primary table and treats them as included columns.

Prerequisites

Manage secondary indexes

You can use DDL statements to manage indexes in HBase Shell or Java API. This section describes how to run DDL statements to manage indexes.

  • Create a secondary index idx1 for the table dt and include all columns in the table. COVERED_ALL_COLUMNS is a keyword. This keyword cannot be used as a column name.
    create_index 'idx1', 'dt', {INDEXED_COLUMNS => ['f1:c2', 'f1:c3']}, {COVERED_COLUMNS => ['COVERED_ALL_COLUMNS']}
  • View the secondary index for the table dt.
    describe_index 'dt'
  • Disable the secondary index idx1 for the table dt. After the secondary index idx1 for the table dt is disabled, the secondary index idx1 will not be synchronously updated when the table dt is updated/modified.
    offline_index 'idx1', 'dt'
  • Delete the secondary index idx1 for the table dt.
    remove_index 'idx1', 'dt'

Create a secondary index

Specify index columns and included columns for the index table.

  • Create a secondary index idx1 for the table dt. idx1 contains two index columns: column c2 in column family f1 and column c3 in column family f2. No other columns are specified as included columns.
    create_index 'idx1', 'dt', {INDEXED_COLUMNS => ['f1:c2', 'f2:c3']}
  • Create a secondary index idx1 for the table dt. idx2 contains two index columns: column c1 in column family f1 and column c2 in column family f2. c2 is an included column.
    create_index 'idx2', 'dt', {INDEXED_COLUMNS => ['f1:c1']}, {COVERED_COLUMNS => ['f2:c2']}
  • Create a secondary index idx3 for the table dt and idx 3 covers all columns in dt. COVERED_ALL_COLUMNS is a keyword. This keyword cannot be used as a column name.
    create_index 'idx3', 'dt', {INDEXED_COLUMNS => ['f1:c3']}, {COVERED_COLUMNS => ['COVERED_ALL_COLUMNS']}
Configure storage settings for the index table.
create_index 'idx1', 'dt', {INDEXED_COLUMNS => ['f1:c1', 'f2:c2']}, {DATA_BLOCK_ENCODING => 'DIFF', BLOOMFILTER => 'ROW',COMPRESSION => 'LZO'}

View index schemas

Run the list command to view a primary table and the associated index tables. The following figure shows the returned information.View index tables
Run the describe_index command to view the schemas of all the index tables that are associated with the primary table. The following figure shows the returned information.View a specified table

Delete an index

Before you run the remove_index command to delete an index, run the offline_index command to disable the index. This operation is similar to that you perform when you delete a common HAbase table. Example:

# Disable the index idx1 for the table dt. After the index idx1 for the table dt is disabled, idx1 will not be updated synchronously when the table dt is modified. 
offline_index 'idx1', 'dt'
# Delete the index table idx1.
remove_index 'idx1', 'dt'

Create indexes for a table that contains historical data

If you run the create_index command to create indexes for a table that contains historical data, the system automatically synchronizes the historical data from the primary table to the index tables. This is time-consuming if the primary table contains large amounts of data. Note: Data synchronization that is triggered by the create_index command is performed on a server node. Even if you terminate a process in HBase Shell, data synchronization does not stop until all the data is synchronized.

The feature of asynchronously creating indexes will be available in the future. This indicates that historical data synchronization is no longer triggered by the create_index command. You must trigger data synchronization by running another command. Then, you can check whether data is synchronized based on the index statuses. If the status of an index is active, data is synchronized.

Execute DML statements to manage indexes

This section describes how to run data manipulation language (DML) statements with the Java API to manage indexes. For more information about how to use the HBase API for Java to establish a connection, see Use the ApsaraDB for HBase Java API to access an ApsaraDB for HBase Performance-enhanced Edition instance.

Insert data

You do not need to insert data into an index table. When you insert data into the primary table, ApsaraDB for HBase automatically synchronizes the data to all the associated index tables. ApsaraDB for HBase Performance-enhanced Edition provides the syntax for data synchronization and updates. This way, you can synchronize updated data to all the associated index tables if you insert data into a primary table. The INSERT operation does not return a response to the client until data is synchronized between the primary table and index tables. The data synchronization has the following types of results:

  • Strong consistency: After you insert data into a primary table, you can immediately query the updated data in the index tables.
  • Update or time-out: ApsaraDB for HBase cannot determine whether the data between the primary table and index tables is consistent at the current point in time. But ApsaraDB for HBase can ensure final data consistency because ApsaraDB for HBase updates the primary table and index tables at the same time or times out the request.

Query data

Similar to relational databases, ApsaraDB for HBase Performance-enhanced Edition allows you to directly query primary tables instead of index tables. ApsaraDB for HBase Performance-enhanced Edition automatically selects the optimal index table based on the schemas of the index tables and the query patterns. Use Filter to define query conditions based on columns other than rowkeys. Example:

byte[] f = Bytes.toBytes("f");
byte[] c1 = Bytes.toBytes("c1");
byte[] value = Bytes.toBytes("yourExpectedValue");

// This is equivalent to select * from dt where f.c1 == value.
Scan scan = new Scan();
SingleColumnValueFilter filter = new SingleColumnValueFilter(f, c1, EQUAL, value);
scan.setFilter(filter);
Important
  • If you use conditions such as LESS and GREATER, take note of how the numbers are sorted. For more information, see the "Sort signed numbers" section.
  • If the query can hit the index, the system automatically sets setFilterIfMissing to true. If the query fails to hit the index, the system scans the entire primary table for a given query.

You can use FilterList to combine AND and OR conditions in a statement. This enables a complex query. Example:

// This is equivalent to where f.c1 >= value1 and f.c1 < value2.
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL);
filters.addFilter(new SingleColumnValueFilter(f, c1, GREATER_OR_EQUAL, value1));
filters.addFilter(new SingleColumnValueFilter(f, c1, LESS, value2));

ApsaraDB for HBase Performance-enhanced Edition automatically selects an index table based on filters and index schemas.

Advanced features

Sort signed numbers

The native HBase API supports only one data type: byte[]. Numbers in byte arrays are sorted in binary order. Therefore, you must convert all data to byte[]. The order of the sorted data must remain the same after you convert the data to byte[]. The HBase client provides the Bytes class for you to switch between byte[] and other data types. However, it is only applicable to 0 and positive numbers. The int data type is used in this example.

Bytes.toBytes(int) is used to enable binary encoding in the preceding figure. If the numbers contain negative numbers, the order of how the int numbers are sorted cannot be preserved after they are converted to byte[]. We can reverse the sign bit of these numbers to resolve this issue. The same issue exists when you convert data of the byte, short, long, and float types. ApsaraDB for HBase Performance-enhanced Edition provides a new class org.apache.hadoop.hbase.util.OrderedBytes for you to resolve this problem. You can find this class on any HBase clients that are reliant on alihbase-client or alihbase-connector. The following example shows how to use this class:

// Convert data of the int type to order-preserved byte[].
int x = 5;
byte[] bytes = OrderedBytes.toBytes(x);

// Convert byte[] to data of the int type.
int y = OrderedBytes.toInt(bytes);

For more information about how to use this class, see the annotations of the class.

Optimize queries

This section explains how ApsaraDB for HBase Performance-enhanced Edition selects indexes for a given query. ApsaraDB for HBase Performance-enhanced Edition selects an index based on prefix searches. This is a rule-based optimization (RBO) policy. When a cluster of the ApsaraDB for HBase Performance-enhanced Edition receives a query, the cluster selects the index table that has the highest match rate based on the query conditions. If you want to specify equality conditions and prefix match conditions in a statement, use the AND operator to combine the conditions. The following examples are provided to explain this rule.

Use the following statements to create a primary table and index tables:

create table 'dt' (rowkey varchar, c1 varchar, c2 varchar, c3 varchar, c4 varchar, c5 varchar constraint pk primary key(rowkey));
create index 'idx1' on 'dt' (c1);
create index 'idx2' on 'dt' (c2, c3, c4);
create index 'idx3' on 'dt' (c3) include (c1, c2, c4);
create index 'idx4' on 'dt' (c5) include (ALL);
Note The preceding SQL statements are for reference only. HBase Shell does not support SQL statements. To create a table, use the HBase Shell commands or the HBase API.

Use the following query statements:

select rowkey from dt where c1 = 'a';
select rowkey from dt where c2 = 'b' and c4 = 'd';
select * from dt where c2 = 'b' and c3 >= 'c' and c3 < 'f';
select * from dt where c5 = 'c';
Note The preceding SQL statements are for reference only. HBase Shell does not support SQL statements. To query a table, use the HBase Shell commands or the HBase API.

Note:

select rowkey from dt where c1 = 'a'

The query hits the index idx1.

select rowkey from dt where c2 = 'b' and c4 = 'd'

The query hits the index idx2. The query looks for rows that meet the c2 = 'b' condition, and then matches these rows against c4 = 'd'. Column c4 is an index column. However, column c3 is not defined in the WHERE condition. As a result, the query fails to match the prefix of idx2.

select * from dt where c2 = 'b' and c3 >= 'c' and c3 < 'f'

The query hits the index idx2 and matches the index table. This query uses the select * statement. However, the index table does not include all the columns of the primary table. Therefore, the system also scans the primary table after the index table is queried. When the system scans the primary table, the system may make several remote procedure calls (RPCs) because the row key columns may be dispersed in the primary table. If the system scans a large amount of data, the response time increases.

select * from dt where c5 = 'c'

This query hits the index idx4 and matches the index table. The index idx3 contains all included columns from the primary table. The query uses the select * statement. Therefore, the system does not need to scan the primary table.

Therefore, you must create index tables based on query patterns. When you design index tables, take into account your future business changes. For more information about how to use indexes, see Database index design and optimization.

Limits

  • Different primary tables can be associated with indexes of the same name. For example, the primary tables dt and foo can all be associated with the index idx1. However, if a primary table is associated with multiple indexes, the name of each index must be unique.
  • You can create indexes only for primary tables that store data of one version.
  • If a TTL value is specified for a primary table, the same TTL value takes effect on the associated index tables. You cannot separately specify a TTL value for each index table.
  • An index can contain a maximum of three index columns.
  • The total size of index columns and rowkeys in the primary table cannot exceed 30 KB. We recommend that you do not use a column that is larger than 100 bytes as an index column.
  • A primary table can be associated with a maximum of five index tables.
  • Each query can hit a maximum of one index. Index-merge queries are not supported.
  • When you create an index table, the data in the primary table is synchronized to the index table. The create_index command is time-consuming if you create an index table for a large table. The feature of asynchronously creating indexes will be available soon.
  • When Filter is used in a table scan, only BinaryComparator is supported. If you use other comparators, the primary table is queried and no errors are returned .
  • Secondary indexes do not support the batch increment feature.
  • The secondary index that can be hit by a query contains columns that are sorted in an order that is different from that in the primary table.
  • If no primary key is configured in the query operation, the query may only be routed to the server where the first region of the table resides.

If you have any questions about using indexes, submit a ticket.