Phoenix provides online transaction processing (OLTP) and operational analytics that are ideal for online workloads that require high service stability. Phoenix provides a variety of powerful features and is flexible. This topic describes how to use Phoenix and provides some best practices.

User manual

Phoenix SQL is built on the SQL-92 standard, but it also supports the syntax outside the SQL-92 standard. The software architecture is different from traditional standalone databases. We recommend that you read the related documentation before you start. The following section lists the references for Phoenix:
  1. Phoenix official documentation
  2. Articles and documents in Yunqi Community provided by the ApsaraDB for HBase team

Usage notes

  1. User guide for secondary indexes

    Secondary indexes play an important role in Phoenix. We recommend that you familiarize yourself with secondary indexes before you start. For more information, see Secondary Indexing, Documents about secondary indexes in Yunqi Community, and Best practices for global index design.

    The following items provide answers to frequently asked questions about how to use secondary indexes:
    1. Do I need to use a covered index?

      Covered indexes add the returned data to index tables. If queries hit indexes, Phoenix queries the index tables to return data. For non-covered indexes, Phoenix must query the primary table to obtain results. You can use covered indexes to reduce the query response time. However, covered indexes consume more storage and may adversely affect the write performance. An execution plan may not hit an index if you use non-covered indexes. In this case, you must add an index hint.

    2. Do I need to use local indexes or global indexes?
      Each global index is associated with an HBase table, whereas a local index specifies that a new column is added to the primary table to store indexes. Global indexes are used in read-heavy scenarios. However, the network overhead is high when global indexes are synchronized. Compared with global indexes, local indexes and raw data are stored in one table. Therefore, it takes less time to synchronize local indexes. Although local indexes can be used in some scenarios, we recommend that you use global indexes due to the following reasons:
      1. Local indexes are not fully optimized for Phoenix, and may cause problems and incur risks to your applications.
      2. Incompatibilities may arise when you make changes to local indexes. The process of upgrading local indexes is complex.
      3. In big data scenarios, region splitting is accelerated if raw data and index data are stored together. After region splitting occurs, the locality of index data is destroyed. This means that the index data and the region are not on the same server.
      Therefore, in the HBase SQL service, the LOCAL INDEX feature has been disabled.
    3. What is the maximum number of index tables that can be created?

      We recommend that you create no more than 10 index tables. Index tables ensure real-time synchronization when you write data in the table, but they also cause a large number of read and write operations. If you need to create more than 10 index tables, we recommend that you use the HBase full-text index feature.

    4. What are the considerations when I create an index?

      When you execute the CREATE INDEX statement and specify the async parameter, the index is created asynchronously. After the statement is executed, the metadata of the index table is stored in the SYSTEM.CATALOG table, and the relationship between the SYSTEM.CATALOG table and the primary table is established. However, the index is not completely built. Therefore, no data exists in the index table, and you cannot query the index table. You must execute the REBUILD statement.

  2. Considerations for salting
    Important The salting feature may cause some problems. We recommend that you do not use this feature. Before you start, make sure that you understand how salting works and the scenarios in which salting is used. If you have questions about salting scenarios, contact us on DingTalk.

    Salting is used in scenarios in which data hotspotting and range queries exist at the same time. For more information about how salting works, see Salted Tables and the document about salted tables in Yunqi Community.

    Salting has strict requirements for application scenarios. If you use salting in inappropriate scenarios, the effect of salting cannot meet your expectation. We recommend that you use salting in the following scenarios:
    • Write hotspotting or write imbalance: If you create a primary key on the first column based on a time, the data is always written to the header or footer of a table.
    • Range query: If you want to perform a range query based on the primary key on the first column, hashing cannot be used.

    The hotspotting and range query requirements cannot be met at the same time. Therefore, to meet the two requirements, you must find a compromise solution that can split the data and keep the order. Salting is the solution that is also called salt buckets. Data is stored sequentially inside a bucket but randomly among buckets. Data modeling is performed based on the number of buckets. The data is randomly written in a bucket to ensure that write requests are balanced among buckets. When you query data, Phoenix reads all buckets to ensure the order and integrity of the result set.

    Generally, business scenarios that strictly meet preceding conditions are not common. You can use other columns to implement hashing in most scenarios. However, we recommend that you do not use the salting feature due to several side effects.

    The salting feature has the following side effects:
    • Write bottleneck: Generally, the number of regions that are used for write requests is the same as the number of buckets. The number of buckets cannot be adjusted to provide more regions for write requests. Therefore, if workloads continuously grow, the write throughput cannot increase linearly when the cluster is scaled out. This leads to a write bottleneck and limits the business development.
    • Read splitting: Each SELECT statement is split and concurrently executed based on the number of buckets. Each concurrent operation consumes a thread during the execution. Excessive concurrent SELECT statements may cause the thread pool to be exhausted or lead to Full Garbage Collection (FGC) of QueryServers. For example, a simple query that can be completed by a Remote Procedure Call (RPC) is split into multiple queries. This increases query response time.

    The preceding side effects restrict business development, especially for the large-scale and fast-growing business. The number of buckets cannot be adjusted, so the write bottleneck may adversely affect the business expansion. The increase of the response time caused by read splitting greatly reduces the efficiency of resource utilization.

    Common misunderstandings:
    • Pre-splitting: The most common misunderstanding is that salt buckets are used to achieve pre-splitting when a table is created. This is because the SPLIT ON syntax that is provided by Phoenix does not work well. You can use HBase Shell to create tables, specify the regions, and then associate them with Phoenix tables. In big data scenarios, proper pre-splitting is a challenge. This will be discussed in a topic in the future.
    • Pseudo-hotspotting: Write hotspotting or imbalances are pseudo. Other columns can still be used for hashing. For example, in monitoring scenarios, the hash value of the metric name can be used as the primary key on the first column to balance write requests.

    Do not use the salting feature for pre-splitting. Design tables based on the read and write mode of workloads. If you have questions, consult the ApsaraDB for HBase Q&A DingTalk group.

    The number of buckets is based on the server configuration and data size. The following rules provide a guideline to calculate the number of buckets. The N parameter specifies the number of Core nodes or RegionServers.
    • 8 GB memory per node: 2*N
    • 16 GB memory per node: 3*N
    • 32 GB memory per node: 4*N
    • 64 GB memory per node: 5*N
    • 128 GB memory per node: 6*N
    Note By default, index tables inherit the salting value of the primary table. The maximum number of buckets is 256, and the data structure of one empty region consumes about 2 MB memory. You must evaluate the total number of regions that can be deployed on a single RegionServer. If you create a large number of salted tables on a node that uses a low configuration, the cluster memory may be exhausted.
  3. Exercise caution when you use full table scans, OR operations, JOIN queries, and subqueries

    Although Phoenix supports a variety of JOIN queries, Phoenix is ​​used as an online database. Complex JOIN queries, such as queries in which subqueries return a large amount of data or queries that are used to join large tables, consume a large number of system resources during the calculation process. This adversely affects the online business and even causes OutOfMemory exceptions. For users who require high online stability and real-time performance, we recommend that you use Phoenix only for simple queries. Queries must hit the primary table or the primary keys of index tables. We also recommend that you execute the EXPLAIN statement before you execute an SQL statement to check whether the query hits an index or the primary key. For more information, see Explain Plan and the document about the explain plan in Yunqi Community.

  4. Phoenix does not support complex queries

    Secondary indexes in Phoenix are used for prefix matching. You can create multiple secondary indexes to support more query patterns. The consistency of secondary indexes is achieved by using the coprocessor, and index data is displayed in real time. However, this also adversely affects the write performance, especially when you have created multiple indexes. Phoenix does not support complex queries, such as queries with combined conditions that are connected by the AND or OR operator, fuzzy matches, and word segmentation.

  5. Phoenix does not support complex analytics

    Phoenix is ​​used for operational analytics. Phoenix is not suitable for complex analytics, such as complex JOIN queries. We recommend that you use Spark for complex analytics. Spark is used as a big data compute engine. For more information, see Choose between HBase SQL (Phoenix) and Spark.

  6. Does Phoenix support mapping to an existing HBase table?

    Yes. For more information, see How I map a Phoenix table to an existing HBase table? You can use Phoenix to create a view or a table to map an existing HBase table. However, if you use a table to map an HBase table and execute the DROP TABLE statement in Phoenix, the HBase table is also deleted. Column families and column names are case-sensitive, so they must be mapped one-to-one. Otherwise, the mapping fails. Most column encoding methods of Phoenix are different from those of the HBase utility class for processing bytes. We recommend that you use mappings only if columns of the VARCHAR data type exist. If columns of other data types exist, do not use mappings.