An Interpretation of the Source Code of OceanBase (7): Implementation Principle of Database Index

By Wenqu and Haiqian

The sixth article of this series (OceanBase Source Code Interpretation (6): Detailed Explanation of Storage Engines) explained the OceanBase storage engine in detail and answered questions about the OceanBase database.

The seventh article of this series briefly introduces the index build process of OceanBase from the perspective of code introduction and explains the relevant code of index build.

1. What Is an Index?

First of all, in a general database, what is the semantics of the index table?

Independent from the primary table (also called the data table), the index table consists of an array of redundant and orderly data and is created to speed up some query processes. The reason why the index table can accelerate query processes is that it is sorted by index keys. If one query condition of a query statement accords with the index prefix, the corresponding rows can be quickly found through binary search. The rows of the index table include the primary key information, so you can quickly find the complete rows of the primary table through the primary keys. This process is also called indexing back to the table.

After knowing the semantics of the index table, how do we create an index?

The index table (like the primary table) has its schema, memory structure, and persistent data (usually stored on the local disk). It may also have its location information in distributed scenarios. Create an index means creating a schema of the index table. Then, create a memory structure of the index table in a certain location and persistent data.

During index creation, we do not want to affect the normal reading and writing of the primary table. The business is online during the index build. Different manufacturers have different implementation mechanisms for online indexing. This article will introduce how OceanBase implements online indexing. Students interested in other solutions can refer to relevant documents.

2. An Overview of the Index Build Process

2.1 The Perspective of the User

First, let's look at the process of index build from the user's perspective. For example, a user sends an index build statement "create index i1 on t1(c2)" on a session, and the user waits on the current session until the index construction succeeds or fails.

2.2 The Perspective of Central Observer

What are the processes from the observer's perspective? First, the text of this statement is randomly sent to an observer (obs). The obs that receives this statement is called central obs. As with other statements, the SQL statement for index build is first found to be a DDL statement by parser and resolver and then parsed into a structure like OceanBaseCreateIndexArg. For DDL statements, OceanBase sends them to RootService (RS) for processing. Therefore, central obs sends an RPC request for index build to RS. This RPC request carries OceanBaseCreateIndexArg.

After receiving the request, RS processes the request through the ObRootService::create_index function. After RS completes the necessary synchronization, it sends the RPC feedback to the central obs, but the index build is not complete. The RS will advance the index build through asynchronous tasks. After receiving the feedback from RS, central obs keeps querying the schema status of the index table to obtain the result of the index build. If the build is complete, it will give positive feedback to the client. If the build fails, it will send the failure error code to the client.

3. The Synchronization Process of RS

As mentioned above, RS completes some synchronization processing before sending the feedback to central obs. In this section, we take a look at the specific process of this part.

Call path:

ObRootService::create_index -> ObIndexBuilder::create_index -> ObIndexBuilder::create_index::do_create_index -> ObIndexBuilder::do_create_local_index
                                                                                                             -> ObIndexBuilder::do_create_global_index

The call process above will do some defensive checks, such as system tables and tables in recycle bin. OceanBase does not support index build. If the number of indexes in a table exceeds the upper limit, indexing is not allowed. After the check, select the local or global index building process based on the index type.

3.1 The Concepts of Global and Local

What is the difference between a global index and a local index? The main difference is that local indexes are partition-level. Partitions of the index table correspond to partitions of the main table one by one. While global indexes are table-level, partitions of the global index table have no corresponding relationship with the partitions of the main table.

In a word, local is at the partition level, and global is at the table level. For example, Table 1 has two hash partitions. If a local index 1 is created, it must have two partitions. The first partition of i1 is the index of the first partition of T1. The second partition of i1 is the index of the second partition of T1. If you create a full-office index i2 for T1, i2 can have one partition or multiple partitions, and the partitions do not correspond to the primary table.

As the local index corresponds to the partitions of the primary table one by one, in OceanBase, we closely bind the partitions of the local index with the partitions of the primary table. This way, the location information of the partitions of the primary table and the partitions of the index table are consistent (on the same machine), thus avoiding cross-machine distributed transactions. Therefore, when selecting the index build path above, there is an optimization for the global index. If the primary table and index table of the global index are non-partitioned, this global index can follow the process of building a local index.

3.2 Control Tasks That Generate Global Indexes

Call path of key functions:

ObIndexBuilder::do_create_global_index -> ObIndexBuilder::generate_schema
                                       -> ObDDLService::create_global_index -> ObDDLService::generate_global_index_locality_and_primary_zone
                                                                               -> ObDDLService::create_user_table -> ObDDLService::create_table_in_trans -> ObDDLOperator::create_table
                                                                                                                                                      -> ObDDLService::create_table_partitions
                                                                                                               -> ObDDLService::publish_schema
                                       -> ObIndexBuilder::submit_build_global_index_task -> ObGlobalIndexBuilder::submit_build_global_index_task

ObIndexBuilder::generate_schema is responsible for generating the basic information of the index table schema. Other information is mainly from the primary table. The index table mainly focuses on the column information. Normal index tables only include index columns and primary key columns, and the duplicate primary key columns are omitted. The unique index must contain index columns, hidden key columns, and primary key columns. What is the hidden key? The hidden key helps understand the comparison issues when the value of the index column is null.

In SQL semantics, null and null are not equal, but in the aspect of code comparison, the values of null and null are equal. If there is a null in the index column, the specific value of the primary key will be entered in the hidden key column. If the index column is not null, the null value will be entered in the hidden key column. The semantics (null and null are not equal) can be achieved during index key comparison with the primary hidden key column. The generate_schema is just a memory object that generates the schema of the column index table. In this case, the index table is not available, so the state of the index table is set to INDEX_STATUS_UNAVAILABLE in the schema.

After the index table schema is generated, you need to write the schema to the internal table. This step is done by ObDDLOperator::create_table.

Then, it is also necessary to create a memory structure of the index table on the relevant machine. Therefore, the location information of the index table is generated through the ObDDLService::generate_global_index_locality_and_primary_zone, and RPC is sent to the target machines through the ObDDLService::create_table_partitions to notify them to create the memory structure of each partition of the index table, including memtable, table_store, and the mapping from partition_key to table_store. Then, notify other machines to refresh the schema through ObDDLService::publish_schema.

After the schema and memory structure of the index table are created, submit the control task with data completed from the global index to the queue through ObGlobalIndexBuilder::submit_build_global_index_task. Later, the control task is used to advance the data completion process of the global index.

When the control task is submitted, the submit_build_global_index_task will create a task record in the __all_index_build_stat, and the status of the control task will be updated to the __all_index_build_stat table.

The global index control task is executed by the ObGlobalIndexBuilder. This thread pool only has one thread, and the queue length is limited by memory (no memory upper limit is set). The entry point for task execution is ObGlobalIndexBuilder::run3 -> ObGlobalIndexBuilder::try_drive.

3.3 Control Tasks That Generate Local Indexes

Call path of key functions:

ObIndexBuilder::do_create_local_index -> ObIndexBuilder::generate_schema
                                      -> ObDDLService::create_user_table -> ObDDLService::create_table_in_trans -> ObDDLOperator::create_table
                                                                                                                -> ObDDLService::create_table_partitions
                                                                         -> ObDDLService::publish_schema
                                       -> ObIndexBuilder::submit_build_local_index_task -> ObRSBuildIndexScheduler::push_task

The procedures for generating schema and creating memory objects for local indexes are almost the same as for global indexes. The only difference is that local indexes do not need to generate index table location information. Other processes are not described here.

After the schema and memory structure of the index table are created, the control task ObRSBuildIndexTask of the local index is put into the queue through ObRSBuildIndexScheduler::push_task. At the same time, the internal table __all_index_build_stat is updated.

The ObDDLTaskExecutor is responsible for executing the control task of the local index. This executor only has one thread, and the queue length is limited by memory (the upper limit of memory is 1GB). The entry for task execution is ObDDLTaskExecutor::run1 -> ObRSBuildIndexTask::process.

4. The Process of the Global Index Build

The control task of the global index, ObGlobalIndexTask, designs a simple state advancement to execute corresponding functions for each task state. The overall idea is to build the baseline data on one copy of the index table first, copy the baseline data to other copies, perform the necessary consistency and uniqueness checks, and let the index take effect.

Code path:

process_function                             task_status
                                   ----------------------------------------------------------------------------
ObGlobalIndexBuilder::try_drive -> try_build_single_replica                     GIBS_BUILD_SINGLE_REPLICA
                                -> try_copy_multi_replica                       GIBS_MULTI_REPLICA_COPY
                                -> try_unique_index_calc_checksum               GIBS_UNIQUE_INDEX_CALC_CHECKSUM
                                -> try_unique_index_check                       GIBS_UNIQUE_INDEX_CHECK
                                -> try_handle_index_build_take_effect           GIBS_INDEX_BUILD_TAKE_EFFECT
                                -> try_handle_index_build_failed                GIBS_INDEX_BUILD_FAILED
                                -> try_handle_index_build_finish                GIBS_INDEX_BUILD_FINISH

4.1 Single Copy Build

Single copy build refers to completing the index table baseline data on one copy. According to the LSM-Tree structure of OceanBase, the baseline data here refers to the major SSTable of the index table.

Code path:

ObGlobalIndexBuilder::try_build_single_replica -> launch_new_build_single_replica -> get_global_index_build_snapshot -> do_get_associated_snapshot
                                                                                  -> hold_snapshot
                                                                                  -> update_task_global_index_build_snapshot
                                                                                  -> do_build_single_replica -> ObRootService::submit_index_sstable_build_task
                                               -> drive_this_build_single_replica -> ObIndexChecksumOperator::check_column_checksum

If you want to build a single copy, you must select a snapshot point to ensure the index table can be seen in all DML operations (incremental data) of the primary table after the snapshot point. This means DML operations after the snapshot point will simultaneously modify the index table. However, this index table is not available for query operations. This write-only behavior is the key for OceanBase to implement an online index build.

After the baseline data (existing data) is constructed based on this snapshot point, the LSM-Tree query will fuse multiple layers of data, so the integrity of the index table data can be guaranteed by idempotence. Let's assume the schema_version is v1 when the index table is created. You need to wait until all transactions that depend on schema_version <= v1 are completed to get that snapshot point. The do_get_associated_snapshot function is the leader sending RPCs to the partitions of the primary table to ask if these transactions are complete. The OBS receiving the request processes it through the ObService:: check_schema_version_elapsed interface. The do_get_associated_snapshot waits for all RPCs to return through wait_all. Note: The RPCs here are bulk and synchronous, so a very large number of partitions may block the process of index task pushing thread.

You need to hold the snapshot point to ensure the selected snapshot point is not released during the single copy construction process. If the snapshot is held for too long, the number of table_store may explode. Then, the selected snapshot points are updated to the internal table __all_index_build_stat. Finally, submit a build task ObIndexSSTableBuildTask for the baseline data of the index table.

After submitting the completion task of baseline data, check the task status through drive_this_build_single_Replica. If the baseline data construction is completed, check the data consistency of the primary table and index table through checksum.

4.2 Baseline Data Completion

The task, ObIndexSSTableBuildTask, is executed by the IdxBuild thread pool. The task queue is 4096, and the number of threads is 16.

Look at the ObIndexSSTableBuildTask execution process and the code path:

ObIndexSSTableBuildTask::process -> ObIndexSSTableBuilder::init
                                 -> ObIndexSSTableBuilder::build -> ObCommonSqlProxy::execute -> ObInnerSQLConnection::execute -> ObInnerSQLConnection::query -> ObInnerSQLConnection::do_query -> ObIndexSSTableBuilder::ObBuildExecutor::execute -> ObIndexSSTableBuilder::build
                                                                                                                               -> ObIndexSSTableBuilder::ObBuildExecutor::process_result -> ObResultSet::get_next_row
                                 -> ObGlobalIndexBuilder::on_build_single_replica_reply

The function, ObIndexSSTableBuilder::build, is executed synchronously. A maximum of 16 baseline completion tasks are executed simultaneously in the system. After the execution is completed, the status of the baseline completion task is changed by on_build_single_replica_reply.

The code path above seems complicated, but a physical execution plan is finally constructed through ObIndexSSTableBuilder::build and executed through ObResultSet::get_next_row. The following code path shows the generation process of the physical execution plan. The constant starting with PHY refers to the type of physical operator.

ObIndexSSTableBuilder::build -> generate_build_param -> split_ranges
                                                     -> store_build_param
                                                     
                             -> gen_data_scan               PHY_TABLE_SCAN_WITH_CHECKSUM
                                                            PHY_UK_ROW_TRANSFORM
                                                           
                             -> gen_data_exchange           PHY_DETERMINATE_TASK_TRANSMIT
                                                            PHY_TASK_ORDER_RECEIVE
                                                           
                             -> gen_build_macro             PHY_SORT
                                                            PHY_APPEND_LOCAL_SORT_DATA
                                                           
                             -> gen_macro_exchange          PHY_DETERMINATE_TASK_TRANSMIT
                                                            PHY_TASK_ORDER_RECEIVE
                                                           
                             -> gen_build_sstable           PHY_APPEND_SSTABLE
                             
                             -> gen_sstable_exchange        PHY_DETERMINATE_TASK_TRANSMIT
                                                            PHY_TASK_ORDER_RECEIVE

The final physical execution plan is shown below:

coordinator                                   |   ObTaskOrderReceive
  transmit                                    |   ObDeterminateTaskTransmit
    append_sstable                            |   ObTableAppendSSTable
      receive                                 |   ObTaskOrderReceive
        transmit_macro_block                  |   ObDeterminateTaskTransmit
          append_local_sort_data              |   ObTableAppendLocalSortData
            sort                              |   ObSort
              receive                         |   ObTaskOrderReceive
                transmit_by_range             |   ObDeterminateTaskTransmit
                  table_scan_with_checksum    |   ObTableScanWithChecksum

4.3 Multi-Replica Copy

Code path:

ObGlobalIndexBuilder::try_copy_multi_replica -> launch_new_copy_multi_replica -> build_task_partition_sstable_stat -> generate_task_partition_sstable_array
                                             -> drive_this_copy_multi_replica -> check_partition_copy_replica_stat
                                                                              -> build_replica_sstable_copy_task -> ObCopySSTableTask::build
                                                                                                                 -> ObRebalanceTaskMgr::add_task

Multi-replica copy is the process of copying baseline data built in the single-replica construction process to other replicas. The actual data copy is completed by ObCopySSTableTask. The task is executed by the ObRebalanceTaskMgr of RS. The entry point is ObCopySSTableTask::execute, which is the RPC that sends the copy_sstable_batch. The execution entry that receives the obs of the RPC is ObService::copy_sstable_batch. After the task of baseline data copy is completed, obs reports the result to RS. RS executes the callback ObGlobalIndexBuilder::on_copy_multi_replica_reply and updates the status of the multi-replica copy task.

4.4 Uniqueness Check

For a unique index, you need to check the uniqueness of the data in index columns. You do not need to perform this check for a non-unique index.

Code path:

ObGlobalIndexBuilder::try_unique_index_calc_checksum -> launch_new_unique_index_calc_checksum -> get_checksum_calculation_snapshot -> do_get_associated_snapshot
                                                                                              -> do_checksum_calculation -> build_task_partition_col_checksum_stat
                                                                                                                         -> send_checksum_calculation_request -> send_col_checksum_calc_rpc
                                                     -> drive_this_unique_index_calc_checksum

You need to select a snapshot point to check the uniqueness. After this snapshot point, the baseline of the index table can be used to see all DML operations (incremental data) on the primary table, so you can check the uniqueness of the DML process. The data before this snapshot point (stock data) can be used to calculate the checksum of the primary and index table columns at this snapshot point. The uniqueness can be checked by comparing with checksum. At this snapshot point, all new transactions of copies can see the baseline data. Let's assume the maximum timestamp for each copy to see the baseline data is sstable_ts. You need to wait for the context creation timestamp of all transactions to pass sstable_ts. The function get_checksum_calculation_snapshot completes the preceding operations and checks whether the timestamp of transaction context creation passes the sstable_ts through the entry: ObPartitionService: check_ctx_create_timestamp_elapsed.

After the snapshot point is available, an RPC is sent to ask the leaders of the primary table and index table to calculate the column checksum of the snapshot point. The processing entry for the obs that receive the RPC is ObService::calc_column_checksum_request. After the calculation is completed, record the column checksum in the internal table __all_index_checksum and notify RS through RPC. RS executes the callback ObGlobalIndexBuilder::on_col_checksum_calculation_reply to update the status of the checksum calculation task. The drive_this_unique_index_calc_checksum continuously checks the status of the checksum calculation task. If all checksum calculations are completed, the checksum comparison is executed by ObGlobalIndexBuilder: try_unique_index_check -> ObIndexChecksumOperator::check_column_checksum.

4.5 Index Status Changes

If all the preceding steps are completed, the ObGlobalIndexBuilder::try_handle_index_build_take_effect function is used to make the index take effect. The schema status of the index table is modified to INDEX_STATUS_AVAILABLE. After the central obs identifies this status, it returns a build success to the client session.

If any of the preceding steps fails, the function changes the index table status to INDEX_STATUS_INDEX_ERROR. After the central obs identifies the status, it returns an index build failure to the client session.

4.6 Intermediate Result Cleanup

After the index build process ends, the intermediate state cleanup must be performed whether the result is successful or failed, including clearing the intermediate result of SQL execution, releasing snapshots, and cleaning up internal tables.

Code path:

ObGlobalIndexBuilder::try_handle_index_build_finish -> clear_intermediate_result -> ObIndexSSTableBuilder::clear_interm_result
                                                                                 -> release_snapshot

5. Build a Local Index

The RS control process of local indexes is relatively simple because the RS side is not the main battlefield.

Code path:

ObRSBuildIndexTask::process -> wait_trans_end -> ObIndexWaitTransStatus::get_wait_trans_status
                                              -> calc_snapshot_version
                                              -> acquire_snapshot
                            -> wait_build_index_end -> report_index_status
                            -> report_index_status
                            -> release_snapshot

5.1 Task Trigger

As the partitions of the local index are bound to the partitions of the primary table one by one, the main battlefield of the local index build is on the obs where the partitions of the primary table are located. Obs triggers the task of building a local index by monitoring the DDL changes of each tenant. After the schema of the index table is launched, obs (where the primary table is located) updates the schema and initiates the local index build task.

Code path:

ObTenantDDLCheckSchemaTask::process -> process_schedule_build_index_task -> get_candidate_tables
                                                                         -> find_build_index_partitions
                                                                         -> generate_schedule_index_task -> ObBuildIndexScheduler::push_task(ObBuildIndexScheduleTask)

ObTenantDDLCheckSchemaTask will find the partition_key to build the index, generate an ObBuildIndexScheduleTask, and put it into the ObBuildIndexScheduler ObDDLTaskExecutor for execution. This executor has four threads, and the queue length is limited to memory. The maximum memory of the task queue is 1GB.

How does this monitoring task come about? When the core service partition_service of an obs starts, the sub-service ObBuildIndexScheduler is activated. ObBuildIndexScheduler has a scheduled task: ObCheckTenantSchemaTask, which continuously generates the ObTenantDDLCheckSchemaTask of each tenant and is also executed in the ObDDLTaskExecutor of ObBuildIndexScheduler. Please see ObCheckTenantSchemaTask::runTimerTask for more information.

5.2 Local Index Build

Code path:

ObBuildIndexScheduleTask::process -> check_partition_need_build_index
                                  -> wait_trans_end -> check_trans_end -> ObPartitionService::check_schema_version_elapsed
                                                    -> report_trans_status
                                  -> wait_snapshot_ready -> get_snapshot_version
                                                         -> check_rs_snapshot_elapsed -> ObTsMgr::wait_gts_elapse
                                                                                      -> ObPartitionService::check_ctx_create_timestamp_elapsed
                                  -> choose_build_index_replica -> get_candidate_source_replica
                                                                -> check_need_choose_replica
                                                                -> ObIndexTaskTableOperator::generate_new_build_index_record
                                  -> wait_choose_or_build_index_end -> get_candidate_source_replica
                                                                    -> check_need_schedule_dag
                                                                    -> schedule_dag -> ObPartitionStorage::get_build_index_param
                                                                                    -> ObPartitionStorage::get_build_index_context
                                                                                    -> ObBuildIndexDag::init
                                                                                    -> alloc_index_prepare_task -> ObIndexPrepareTask::init
                                                                                                                -> ObIDag::add_task
                                                                                    -> ObDagScheduler::add_dag
                                  -> copy_build_index_data -> send_copy_replica_rpc
                                                           -> ObPartitionService::check_single_replica_major_sstable_exist
                                  -> unique_index_checking -> ObUniqueCheckingDag::init
                                                           -> ObUniqueCheckingDag::alloc_local_index_task_callback
                                                           -> ObUniqueCheckingDag::alloc_unique_checking_prepare_task -> ObUniqueCheckingPrepareTask::init
                                                                                                                      -> ObIDag::add_task
                                                           -> ObDagScheduler::add_dag
                                  -> wait_report_status -> check_all_replica_report_build_index_end

The overall process of building a local index is similar to a global index. After the transaction is completed and the snapshot point is available, select a copy to build a single replica. After the single replica is built and the baseline data is copied to other replicas, perform uniqueness checks before the index takes effect. The construction of baseline data is completed through ObBuildIndexDag, and the uniqueness check is completed by ObUniqueCheckingDag.

Community

An Interpretation of the Source Code of OceanBase (7): Implementation Principle of Database Index

1. What Is an Index?

2. An Overview of the Index Build Process

2.1 The Perspective of the User

2.2 The Perspective of Central Observer

3. The Synchronization Process of RS

3.1 The Concepts of Global and Local

3.2 Control Tasks That Generate Global Indexes

3.3 Control Tasks That Generate Local Indexes

4. The Process of the Global Index Build

4.1 Single Copy Build

4.2 Baseline Data Completion

4.3 Multi-Replica Copy

4.4 Uniqueness Check

4.5 Index Status Changes

4.6 Intermediate Result Cleanup

5. Build a Local Index

5.1 Task Trigger

5.2 Local Index Build

Read previous post:

Read next post:

OceanBase

You may also like

Comments

OceanBase

Related Products

Managed Service for Prometheus

YiDA Low-code Development Platform

Phone Number Verification Service

Database for FinTech Solution