HBase supports rowkey (primary key) indexing, allowing you to sort rows based on the binary order of rowkeys. Based on rowkey indexing, row scans, prefix scans, and range scans can be performed efficiently. However, if you want to query a table based on a column other than rowkeys, you must use filters to narrow down the range of rowkeys. Otherwise, the entire table is scanned, resulting in a waste of resources and increased response time.

Many solutions are provided for users to query HBase data in multiple dimensions. For example, you can create a secondary index table. In this case, you need to manually maintain the secondary indexes. You can also export the data to be queried to an external system, such as Solr or Elasticsearch, and then index the data. Search engines such as Solr and Elasticsearch provide powerful Ad Hoc Query capabilities. ApsaraDB for HBase is integrated with the full-text index service. This service saves you the effort of deploying and maintaining an external search engine.

However, it is recommended that you do not use these power search engines in scenarios where regular query patterns are used to query tables containing only a few columns. ApsaraDB for HBase Performance-enhanced Edition offers a solution that uses native secondary indexes to query HBase data with lower costs. Secondary indexes are built in ApsaraDB for HBase to support high throughput and performance. This solution has been tested on businesses, such as Double 11 Shopping Festivals, in Alibaba for many years. It is suitable for indexing large amounts of data. The following figure compares the read/write performance of data indexed using ApsaraDB for HBase with that using Phoenix.

perf

The following sections introduce two important terms in indexes, and then describe how to use Data Definition Language (DDL) and Data Manipulation Language (DML) statements to manage secondary indexes. Advanced examples are provided to show how to sort rows based on the binary order of rowkeys and how to optimize the performance of queries. The last section lists the constraints and limits about using secondary indexes.

Terms

Create a primary table and four index tables as follows:

  1. create table dt (rowkey varchar, c1 varchar, c2 varchar, c3 varchar, c4 varchar, c5 varchar constraint pk primary key(rowkey));create index idx1 on dt (c1);create index idx2 on dt (c2, c3, c4);create index idx3 on dt (c3) include (c1, c2, c4);create index idx4 on dt (c5) include (ALL);

Index key columns

Index key columns refer to primary key columns in an index table. For example, column c1 in table idx1 and columns c2, c3, and c4 in table idx2 are index key columns. Index key columns and their order determine what kind of query patterns that an index table supports. A single-column index is an index table that contains only one index key column. A multi-column index (also known as a composite index) is an index table that contains more than one index key column. For more information about how to make queries hit a specified index table, see Query optimization.

Index included columns

If the system cannot find the required index key column in the index table, it scans the primary table. In a distributed search scenario, primary table scanning causes unnecessary RPC calls, which significantly increases the response time. To reduce the query time at the cost of storage space, you can configure an index table to cover the required columns in the primary table to avoid primary table scanning when a query hits the index table. Index included columns refer to columns of the primary table covered by an index table. For example, indexes idx3 and idx4 contain index included columns.

As your business develops, new columns may be added to the primary table. Therefore, ApsaraDB for HBase provides the syntax for you to treat all columns in a primary table as index included columns, as shown in idx4. The index table automatically covers all the columns of the primary table and treats them as index included columns.

Preparation

  • Cluster version: The version of your ApsaraDB for HBase cluster must be 2.1.10 or later. You can click the minor version of your cluster in the console to upgrade the cluster version.
  • Client version: The client version must be later than alihbase-client 1.1.9/2.0.4, or later than alihbase-connector1.0.9/2.09. For more information, see Install the SDK for Java.
  • HBase Shell version: The HBase Shell version must be later than alihbase-2.0.9-bin.tar.gz. You can visit HBase Shell to download the latest version of HBase Shell for ApsaraDB for HBase Performance-enhanced Edition.

Use DDL statements to manage indexes

You can use DDL statements to manage indexes in HBase Shell or Java API. This section describes how to run DDL statements in HBase Shell to manage indexes. For more information about how to use the Java API, see annotations about the AliHBaseUEAdmin operation.

The following lists some common DDL statements used to manage indexes:

  1. # Create an index# Create the index idx1 for the primary table dt: create index idx1 on dt (‘f1:c2’, ‘f1:c3’);# Include all the columns of the primary table. You must not use COVERED_ALL_COLUMNS as a column name because it is used as a keyword in this command.hbase(main):002:0> create_index ‘idx1’, ‘dt’, {INDEXED_COLUMNS => [‘f1:c2’, ‘f1:c3’]}, {COVERED_COLUMNS => [‘COVERED_ALL_COLUMNS’]}# View the index schema.hbase(main):002:0>describe_index ‘dt’# Disable the index table idx1 for the primary table dt. Consequently, when you update the primary table dt, idx1 is not updated.hbase(main):002:0>offline_index ‘idx1’, ‘dt’# Delete the index table idx1.hbase(main):002:0>remove_index ‘idx1’, ‘dt’

The commands are described in details as follows.

Create an index

  1. hbase(main):002:0>create_index idx1’, dt’, {INDEXED_COLUMNS => [‘f1:c2’, f2:c3’]}

Create the index table idx1 for the primary table dt. The index table contains two index key columns: column c2 in column family f1 and column c3 in column family f2. No other columns are specified as index included columns.

  1. hbase(main):002:0>create_index idx2’, dt’, {INDEXED_COLUMNS => [‘f1:c1’]}, {COVERED_COLUMNS => [‘f2:c2’]}

Create the index table idx1 for the primary table dt. The index table contains one index key column and one index included column: column c2 in column family f1 and column c3 in column family f2.

  1. hbase(main):002:0>create_index idx3’, dt’, {INDEXED_COLUMNS => [‘f1:c3’]}, {COVERED_COLUMNS => [‘COVERED_ALL_COLUMNS’]}

The index table idx3 covers all the columns of the primary table dt. Therefore, the system does not need to scan the primary table for a given query that hits idx3. The COVERED_ALL_COLUMNS keyword is specified to cover all the columns of the primary table. Therefore, you must not use it as a column name.

After you create the schema (index key columns and index included columns) of an index table, you must also configure storage settings for the index table. Example:

  1. hbase(main):002:0>create_index idx1’, dt’, {INDEXED_COLUMNS => [‘f1:c1’, f2:c2’]}, {DATA_BLOCK_ENCODING => DIFF’, BLOOMFILTER => ROW’,COMPRESSION => LZO’}

View the index schema

Run the list command to view a primary table and the associated index tables. Example:

Run the describe_index command to view the schemas of all index tables associated with a primary table.

Delete an index

You must run the offline index command to disable an index before you can run the remove_index command to delete the index. Example:

  1. # Disable the index table idx1 for the primary table dt. Consequently, when you update the primary table dt, the index idx1 is not updated.hbase(main):002:0>offline_index ‘idx1’, ‘dt’# Delete the index table idx1.hbase(main):002:0>remove_index ‘idx1’, ‘dt’

You cannot run the disable command to disable an index table. If the status of an index table is offline, no query hits the table.

Create indexes for a table that contains historical data

If you run the create_index command to create indexes for a table that contains historical data, the command automatically synchronizes the historical data from the primary table to the index tables. This can be time-consuming when the primary table contains a large amount of data. Note: The data synchronization task triggered by the create_index command runs at the server end. Even if you terminate the hbase.shell process, the task continues running until data synchronization is complete.

Creating indexes asynchronously will soon be supported. With this feature, historical data synchronization is no longer triggered by the create_index command. You need to manually trigger the task by running a command. You can check the progress of the synchronization task by verifying that the status of the index is active or not.

Manage indexes (DML)

This section describes how to run DML statements with the Java API to manage indexes. For more information about how to use the HBase Java API and connect to ApsaraDB for HBase, see Use the Java API to connect to ApsaraDB for HBase.

Insert data

You do not need to manually insert data into your index table. When you insert data to the primary table, ApsaraDB for HBase automatically synchronizes the data to all the associated index tables. ApsaraDB for HBase Performance-enhanced Edition defines the synchronization logic for you to synchronize updates to all the associated index tables when you write the primary table. The insert operation does not return a response to the client until the primary and index tables are finished updating. The logic can be interpreted as follows:

  • High consistency: After you complete writing the primary table, the updates can be immediately retrieved.
  • Update or time out: When ApsaraDB for HBase updates the primary and index tables, it does not know whether the primary and index tables contain the same data. For both the primary and index tables to reach data consistency at the final state, ApsaraDB for HBase chooses to update all of them or times out the write request.

Query data

Similar to relational databases, ApsaraDB for HBase Performance-enhanced Edition allows you to query primary tables instead of querying index tables. ApsaraDB for HBase Performance-enhanced Edition automatically selects the optimal index table based on the schemas of the index tables and the query patterns. Use Filter to define query conditions based on columns other than rowkeys. Example:

  1. byte[] f = Bytes.toBytes(“f”);byte[] c1 = Bytes.toBytes(“c1”);byte[] value = Bytes.toBytes(“yourExpectedValue”);// This is equivalent to select * from dt where f.c1 == value.Scan scan = new Scan();SingleColumnValueFilter filter = new SingleColumnValueFilter(f, c1, EQUAL, value);filter.setFilterIfMissing(true);scan.setFilter(filter);

Notes:

  • When you use conditions such as LESS and GREATER, you must pay close attention to how the numbers are sorted. For more information, see Sort signed numbers.
  • You must set the setFilterIfMissing parameter to true to use indexes. Otherwise, the system scans the entire primary table for a given query.

By using FilterList, you can combine AND and OR conditions in the statement to support more complex queries. Example:

  1. // This is equivalent to where f.c1 >= value1 and f.c1 < value2.FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL);filters.addFilter(new SingleColumnValueFilter(f, c1, GREATER_OR_EQUAL, value1));filters.addFilter(new SingleColumnValueFilter(f, c1, LESS, value2));

ApsaraDB for HBase Performance-enhanced Edition automatically selects an index table based on filters and index schemas. For more information, see the Query optimization section.

Advanced features

Sort signed numbers

The HBase-native API only supports one data type: byte[] (byte arrays). Numbers in byte arrays are sorted in binary order. Therefore, you need to convert all the data to byte[]. This involves preserving the order of how the numbers are sorted after you convert them to byte[] . The HBase client provides the Bytes class for you to switch between byte[] and other data types. However, it is only applicable to 0 and positive numbers. It does not support negative numbers. The following figure takes the INT type as an example.

  1. <p class="p" id="p-bso-heb-jy7"></p>
  2. <p class="p" id="p-yon-k1l-um6">In this figure, <code class="ph codeph" id="codeph-xqs-jqa-wld">Bytes.toBytes(int)</code> means <code class="ph codeph" id="codeph-h11-4jv-kx8">binary encoding</code>. When the numbers contain negative numbers, the order of how the <code class="ph codeph" id="codeph-fuq-iwd-yuj">INT</code> numbers are sorted cannot be preserved after they are converted to <code class="ph codeph" id="codeph-ar6-4oi-a2w">byte[]</code>. We can reverse the sign bit of these numbers to resolve this problem. You can use
  3. this solution to handle the same problem in <code class="ph codeph" id="codeph-xjo-bwx-6c1">byte</code>, <code class="ph codeph" id="codeph-dl0-qf3-mmr">short</code>, <code class="ph codeph" id="codeph-1k4-amj-15u">long</code>, and <code class="ph codeph" id="codeph-q09-tlo-vlw">float</code> data. ApsaraDB for HBase Performance-enhanced Edition provides a new class <code class="ph codeph" id="codeph-mbz-e75-cqi">org.apache.hadoop.hbase.util.OrderedBytes</code> for you to resolve this problem. You can find this class on any HBase clients that
  4. are reliant on alihbase-client or alihbase-connector. The following example shows
  5. how to use this class:
  6. </p>
  7. <pre class="pre codeblock language-java" id="codeblock-71w-dsc-42b"><code>// Convert INT to order-preserved byte[].

int x = 5;byte[] bytes = OrderedBytes.toBytes(x);

// Convert byte[] to INT.int y = OrderedBytes.toInt(bytes);

For more information about how to use this class, see the annotations of the class.

  1. </section>
  2. <section class="section" id="section-njz-q02-aku">
  3. <h2 class="title sectiontitle" id="title-0p1-7ik-vst">Query optimization</h2>
  4. <p class="p" id="p-bku-sdt-x3h">This section explains how ApsaraDB for HBase Performance-enhanced Edition selects
  5. indexes for a given query. This is equivalent to <code class="ph codeph" id="codeph-bck-egd-21e">prefix match</code>. It is a Rule Based Optimization (RBO) policy. When the cluster receives a query,
  6. it selects the index table with the highest matching rate based on the query conditions
  7. combined with AND and the prefixes of the matching index tables. The following examples
  8. are provided to explain this rule.
  9. </p>
  10. <p class="p" id="p-j42-5xq-oce">For example, you have created the following primary table and index tables:</p>
  11. <pre class="pre codeblock language-sql" id="codeblock-lct-6pm-23m"><code>create table 'dt' (rowkey varchar, c1 varchar, c2 varchar, c3 varchar, c4 varchar, c5 varchar constraint pk primary key(rowkey));

create index ‘idx1’ on ‘dt’ (c1);create index ‘idx2’ on ‘dt’ (c2, c3, c4);create index ‘idx3’ on ‘dt’ (c3) include (c1, c2, c4);create index ‘idx4’ on ‘dt’ (c5) include (ALL);

The following queries are sent to query the primary table:

  1. select rowkey from dt where c1 = a’;select rowkey from dt where c2 = b and c4 = d’;select from dt where c2 = b and c3 >= c and c3 < f’;select from dt where c5 = c’;

Now we analyze these queries in sequence:

  1. <p class="p" id="p-sjc-75w-ps2">1) <code class="ph codeph" id="codeph-zk7-6mr-73i">select rowkey from dt where c1 = 'a'</code></p>
  2. <p class="p" id="p-qph-1ml-lug">This query hits the index table <code class="ph codeph" id="codeph-3ad-swd-1a1">idx1</code>.
  3. </p>
  4. <p class="p" id="p-wc0-ast-g3i">2) <code class="ph codeph" id="codeph-wzs-8dv-fbj">select rowkey from dt where c2 = 'b' and c4 = 'd'</code></p>
  5. <p class="p" id="p-be6-5lh-dn7">This query hits the index table <code class="ph codeph" id="codeph-5dy-7fl-le0">idx2</code>. It looks for rows that meet the <code class="ph codeph" id="codeph-enl-15o-u21">c2 ='b'</code> condition, and then matches these rows against <code class="ph codeph" id="codeph-2t3-oi7-eqb">c4 = 'd'</code>. Column C4 is an index key column. However, column C3 is not defined in the WHERE
  6. condition. Consequently, the query fails to match the prefix of idx2.
  7. </p>
  8. <p class="p" id="p-iap-1tc-a57">3) <code class="ph codeph" id="codeph-p8l-tk3-bko">select * from dt where c2 = 'b' and c3 &gt;= 'c' and c3 &lt; 'f'</code></p>
  9. <p class="p" id="p-f6x-orm-r55">The query hits the index table <code class="ph codeph" id="codeph-jkb-abs-vvc">idx2</code> and completely matches the table. This query uses the <code class="ph codeph" id="codeph-6xz-4x7-f0h">select *</code> statement, but the index table does not include all the columns from the primary
  10. table. Therefore, the system must also look up the primary table. When the system
  11. looks up the primary table, it may need to make several RPC calls because the rowkeys
  12. may be dispersed in the primary table. If the system needs to scan a large amount
  13. of data, the response time increases accordingly.
  14. </p>
  15. <p class="p" id="p-zt1-iq2-o3x">4) <code class="ph codeph" id="codeph-38l-t5t-zti">select * from dt where c5 = 'c'</code></p>
  16. <p class="p" id="p-22s-9ux-23q">This query hits the index table <code class="ph codeph" id="codeph-wjr-zkl-nxh">idx4</code> and completely matches the table. The index table idx3 contains all the columns from
  17. the primary table. The query uses the <code class="ph codeph" id="codeph-dhp-dsm-xxd">select *</code> statement. Therefore, the system does not need to scan the primary table.
  18. </p>
  19. <p class="p" id="p-k90-384-dsu">Based on these conclusions, you need to consider the query patterns and then design
  20. your index tables accordingly. For more information about how to use indexes in a
  21. better manner, see "Database index design and optimization".
  22. </p>
  23. </section>
  24. <section class="section" id="section-xo1-6i2-a6o">
  25. <h2 class="title sectiontitle" id="title-fzz-qeu-nvt">Constraints and limits</h2>
  26. <ul class="ul" id="ul-73s-ss9-h1c">
  27. <li class="li" id="li-fvk-laz-yy3">Different primary tables can be associated with index tables with the same name. For
  28. example, the primary table dt can be associated with the index table idx1 and the
  29. primary table foo can also be associated with an index table using the same name.
  30. However, the names of the index tables associated with the same primary table must
  31. be unique.
  32. </li>
  33. <li class="li" id="li-zd3-ob7-vlx">You can only create indexes for primary tables that have one version (version = 1).</li>
  34. <li class="li" id="li-gua-5co-0d6">If a primary table is assigned a TTL value, its index tables are assigned the same
  35. TTL as the primary table. You cannot set a TTL for the index tables.
  36. </li>
  37. <li class="li" id="li-iou-y7x-tjp">An index table cannot contain more than 3 index key columns.</li>
  38. <li class="li" id="li-392-e2x-q8r">The total sizes of index key columns and rowkeys in the primary table must not be
  39. greater than 30 KB. We recommend that you do not use a column larger than 100 bytes
  40. as an index key column.
  41. </li>
  42. <li class="li" id="li-f2s-629-24z">A primary table cannot be associated with more than 5 index tables.</li>
  43. <li class="li" id="li-235-zzu-lwo">A query can only hit one index table. Index Merge queries are not supported.</li>
  44. <li class="li" id="li-cp4-wdn-wct">When you create an index table, the data in the primary table is synchronized to the
  45. index table. It will be time-consuming if you create an index table for a large data
  46. table. Creating indexes asynchronously will soon be supported.
  47. </li>
  48. </ul>
  49. <p class="p" id="p-8ey-hbe-t48">The following features are currently not released:</p>
  50. <ul class="ul" id="ul-q9l-f22-xwq">
  51. <li class="li" id="li-foz-e5i-z5o">Create indexes asynchronously: This feature only creates indexes. It does not synchronize
  52. historical data from the primary table to the index tables. You must run another command
  53. to create indexes on the historical data.
  54. </li>
  55. <li class="li" id="li-vcq-t2r-7ac">Insert data with custom timestamps.</li>
  56. </ul>
  57. <p class="p" id="p-x56-xyg-j8y">If you have any questions about using indexes, submit a ticket or consult the <code class="ph codeph" id="codeph-yg7-xr2-w6f">ApsaraDB for HBase Q&amp;A</code> DingTalk group.
  58. </p>
  59. </section>
  60. </div>
  61. </article>
  62. </main>