By Wenqu and Haiqian
The sixth article of this series (OceanBase Source Code Interpretation (6): Detailed Explanation of Storage Engines) explained the OceanBase storage engine in detail and answered questions about the OceanBase database.
The seventh article of this series briefly introduces the index build process of OceanBase from the perspective of code introduction and explains the relevant code of index build.
First of all, in a general database, what is the semantics of the index table?
Independent from the primary table (also called the data table), the index table consists of an array of redundant and orderly data and is created to speed up some query processes. The reason why the index table can accelerate query processes is that it is sorted by index keys. If one query condition of a query statement accords with the index prefix, the corresponding rows can be quickly found through binary search. The rows of the index table include the primary key information, so you can quickly find the complete rows of the primary table through the primary keys. This process is also called indexing back to the table.
After knowing the semantics of the index table, how do we create an index?
The index table (like the primary table) has its schema, memory structure, and persistent data (usually stored on the local disk). It may also have its location information in distributed scenarios. Create an index means creating a schema of the index table. Then, create a memory structure of the index table in a certain location and persistent data.
During index creation, we do not want to affect the normal reading and writing of the primary table. The business is online during the index build. Different manufacturers have different implementation mechanisms for online indexing. This article will introduce how OceanBase implements online indexing. Students interested in other solutions can refer to relevant documents.
First, let's look at the process of index build from the user's perspective. For example, a user sends an index build statement "create index i1 on t1(c2)" on a session, and the user waits on the current session until the index construction succeeds or fails.
What are the processes from the observer's perspective? First, the text of this statement is randomly sent to an observer (obs). The obs that receives this statement is called central obs. As with other statements, the SQL statement for index build is first found to be a DDL statement by parser and resolver and then parsed into a structure like
OceanBaseCreateIndexArg. For DDL statements, OceanBase sends them to RootService (RS) for processing. Therefore, central obs sends an RPC request for index build to RS. This RPC request carries
After receiving the request, RS processes the request through the
ObRootService::create_index function. After RS completes the necessary synchronization, it sends the RPC feedback to the central obs, but the index build is not complete. The RS will advance the index build through asynchronous tasks. After receiving the feedback from RS, central obs keeps querying the schema status of the index table to obtain the result of the index build. If the build is complete, it will give positive feedback to the client. If the build fails, it will send the failure error code to the client.
As mentioned above, RS completes some synchronization processing before sending the feedback to central obs. In this section, we take a look at the specific process of this part.
ObRootService::create_index -> ObIndexBuilder::create_index -> ObIndexBuilder::create_index::do_create_index -> ObIndexBuilder::do_create_local_index -> ObIndexBuilder::do_create_global_index
The call process above will do some defensive checks, such as system tables and tables in recycle bin. OceanBase does not support index build. If the number of indexes in a table exceeds the upper limit, indexing is not allowed. After the check, select the local or global index building process based on the index type.
What is the difference between a global index and a local index? The main difference is that local indexes are partition-level. Partitions of the index table correspond to partitions of the main table one by one. While global indexes are table-level, partitions of the global index table have no corresponding relationship with the partitions of the main table.
In a word, local is at the partition level, and global is at the table level. For example, Table 1 has two hash partitions. If a local index 1 is created, it must have two partitions. The first partition of i1 is the index of the first partition of T1. The second partition of i1 is the index of the second partition of T1. If you create a full-office index i2 for T1, i2 can have one partition or multiple partitions, and the partitions do not correspond to the primary table.
As the local index corresponds to the partitions of the primary table one by one, in OceanBase, we closely bind the partitions of the local index with the partitions of the primary table. This way, the location information of the partitions of the primary table and the partitions of the index table are consistent (on the same machine), thus avoiding cross-machine distributed transactions. Therefore, when selecting the index build path above, there is an optimization for the global index. If the primary table and index table of the global index are non-partitioned, this global index can follow the process of building a local index.
Call path of key functions:
ObIndexBuilder::do_create_global_index -> ObIndexBuilder::generate_schema -> ObDDLService::create_global_index -> ObDDLService::generate_global_index_locality_and_primary_zone -> ObDDLService::create_user_table -> ObDDLService::create_table_in_trans -> ObDDLOperator::create_table -> ObDDLService::create_table_partitions -> ObDDLService::publish_schema -> ObIndexBuilder::submit_build_global_index_task -> ObGlobalIndexBuilder::submit_build_global_index_task
ObIndexBuilder::generate_schema is responsible for generating the basic information of the index table schema. Other information is mainly from the primary table. The index table mainly focuses on the column information. Normal index tables only include index columns and primary key columns, and the duplicate primary key columns are omitted. The unique index must contain index columns, hidden key columns, and primary key columns. What is the hidden key? The hidden key helps understand the comparison issues when the value of the index column is null.
In SQL semantics, null and null are not equal, but in the aspect of code comparison, the values of null and null are equal. If there is a null in the index column, the specific value of the primary key will be entered in the hidden key column. If the index column is not null, the null value will be entered in the hidden key column. The semantics (null and null are not equal) can be achieved during index key comparison with the primary hidden key column. The generate_schema is just a memory object that generates the schema of the column index table. In this case, the index table is not available, so the state of the index table is set to
INDEX_STATUS_UNAVAILABLE in the schema.
After the index table schema is generated, you need to write the schema to the internal table. This step is done by
Then, it is also necessary to create a memory structure of the index table on the relevant machine. Therefore, the location information of the index table is generated through the
ObDDLService::generate_global_index_locality_and_primary_zone, and RPC is sent to the target machines through the
ObDDLService::create_table_partitions to notify them to create the memory structure of each partition of the index table, including memtable, table_store, and the mapping from partition_key to table_store. Then, notify other machines to refresh the schema through
After the schema and memory structure of the index table are created, submit the control task with data completed from the global index to the queue through
ObGlobalIndexBuilder::submit_build_global_index_task. Later, the control task is used to advance the data completion process of the global index.
When the control task is submitted, the
submit_build_global_index_task will create a task record in the
__all_index_build_stat, and the status of the control task will be updated to the
The global index control task is executed by the
ObGlobalIndexBuilder. This thread pool only has one thread, and the queue length is limited by memory (no memory upper limit is set). The entry point for task execution is
ObGlobalIndexBuilder::run3 -> ObGlobalIndexBuilder::try_drive.
Call path of key functions:
ObIndexBuilder::do_create_local_index -> ObIndexBuilder::generate_schema -> ObDDLService::create_user_table -> ObDDLService::create_table_in_trans -> ObDDLOperator::create_table -> ObDDLService::create_table_partitions -> ObDDLService::publish_schema -> ObIndexBuilder::submit_build_local_index_task -> ObRSBuildIndexScheduler::push_task
The procedures for generating schema and creating memory objects for local indexes are almost the same as for global indexes. The only difference is that local indexes do not need to generate index table location information. Other processes are not described here.
After the schema and memory structure of the index table are created, the control task
ObRSBuildIndexTask of the local index is put into the queue through
ObRSBuildIndexScheduler::push_task. At the same time, the internal table
__all_index_build_stat is updated.
ObDDLTaskExecutor is responsible for executing the control task of the local index. This executor only has one thread, and the queue length is limited by memory (the upper limit of memory is 1GB). The entry for task execution is
ObDDLTaskExecutor::run1 -> ObRSBuildIndexTask::process.
The control task of the global index,
ObGlobalIndexTask, designs a simple state advancement to execute corresponding functions for each task state. The overall idea is to build the baseline data on one copy of the index table first, copy the baseline data to other copies, perform the necessary consistency and uniqueness checks, and let the index take effect.
process_function task_status ---------------------------------------------------------------------------- ObGlobalIndexBuilder::try_drive -> try_build_single_replica GIBS_BUILD_SINGLE_REPLICA -> try_copy_multi_replica GIBS_MULTI_REPLICA_COPY -> try_unique_index_calc_checksum GIBS_UNIQUE_INDEX_CALC_CHECKSUM -> try_unique_index_check GIBS_UNIQUE_INDEX_CHECK -> try_handle_index_build_take_effect GIBS_INDEX_BUILD_TAKE_EFFECT -> try_handle_index_build_failed GIBS_INDEX_BUILD_FAILED -> try_handle_index_build_finish GIBS_INDEX_BUILD_FINISH
Single copy build refers to completing the index table baseline data on one copy. According to the LSM-Tree structure of OceanBase, the baseline data here refers to the major SSTable of the index table.
ObGlobalIndexBuilder::try_build_single_replica -> launch_new_build_single_replica -> get_global_index_build_snapshot -> do_get_associated_snapshot -> hold_snapshot -> update_task_global_index_build_snapshot -> do_build_single_replica -> ObRootService::submit_index_sstable_build_task -> drive_this_build_single_replica -> ObIndexChecksumOperator::check_column_checksum
If you want to build a single copy, you must select a snapshot point to ensure the index table can be seen in all DML operations (incremental data) of the primary table after the snapshot point. This means DML operations after the snapshot point will simultaneously modify the index table. However, this index table is not available for query operations. This write-only behavior is the key for OceanBase to implement an online index build.
After the baseline data (existing data) is constructed based on this snapshot point, the LSM-Tree query will fuse multiple layers of data, so the integrity of the index table data can be guaranteed by idempotence. Let's assume the
schema_version is v1 when the index table is created. You need to wait until all transactions that depend on
schema_version <= v1 are completed to get that snapshot point. The
do_get_associated_snapshot function is the leader sending RPCs to the partitions of the primary table to ask if these transactions are complete. The OBS receiving the request processes it through the
ObService:: check_schema_version_elapsed interface. The
do_get_associated_snapshot waits for all RPCs to return through
wait_all. Note: The RPCs here are bulk and synchronous, so a very large number of partitions may block the process of index task pushing thread.
You need to hold the snapshot point to ensure the selected snapshot point is not released during the single copy construction process. If the snapshot is held for too long, the number of
table_store may explode. Then, the selected snapshot points are updated to the internal table
__all_index_build_stat. Finally, submit a build task
ObIndexSSTableBuildTask for the baseline data of the index table.
After submitting the completion task of baseline data, check the task status through
drive_this_build_single_Replica. If the baseline data construction is completed, check the data consistency of the primary table and index table through checksum.
ObIndexSSTableBuildTask, is executed by the IdxBuild thread pool. The task queue is 4096, and the number of threads is 16.
Look at the
ObIndexSSTableBuildTask execution process and the code path:
ObIndexSSTableBuildTask::process -> ObIndexSSTableBuilder::init -> ObIndexSSTableBuilder::build -> ObCommonSqlProxy::execute -> ObInnerSQLConnection::execute -> ObInnerSQLConnection::query -> ObInnerSQLConnection::do_query -> ObIndexSSTableBuilder::ObBuildExecutor::execute -> ObIndexSSTableBuilder::build -> ObIndexSSTableBuilder::ObBuildExecutor::process_result -> ObResultSet::get_next_row -> ObGlobalIndexBuilder::on_build_single_replica_reply
ObIndexSSTableBuilder::build, is executed synchronously. A maximum of 16 baseline completion tasks are executed simultaneously in the system. After the execution is completed, the status of the baseline completion task is changed by
The code path above seems complicated, but a physical execution plan is finally constructed through
ObIndexSSTableBuilder::build and executed through
ObResultSet::get_next_row. The following code path shows the generation process of the physical execution plan. The constant starting with PHY refers to the type of physical operator.
ObIndexSSTableBuilder::build -> generate_build_param -> split_ranges -> store_build_param -> gen_data_scan PHY_TABLE_SCAN_WITH_CHECKSUM PHY_UK_ROW_TRANSFORM -> gen_data_exchange PHY_DETERMINATE_TASK_TRANSMIT PHY_TASK_ORDER_RECEIVE -> gen_build_macro PHY_SORT PHY_APPEND_LOCAL_SORT_DATA -> gen_macro_exchange PHY_DETERMINATE_TASK_TRANSMIT PHY_TASK_ORDER_RECEIVE -> gen_build_sstable PHY_APPEND_SSTABLE -> gen_sstable_exchange PHY_DETERMINATE_TASK_TRANSMIT PHY_TASK_ORDER_RECEIVE
The final physical execution plan is shown below:
coordinator | ObTaskOrderReceive transmit | ObDeterminateTaskTransmit append_sstable | ObTableAppendSSTable receive | ObTaskOrderReceive transmit_macro_block | ObDeterminateTaskTransmit append_local_sort_data | ObTableAppendLocalSortData sort | ObSort receive | ObTaskOrderReceive transmit_by_range | ObDeterminateTaskTransmit table_scan_with_checksum | ObTableScanWithChecksum
ObGlobalIndexBuilder::try_copy_multi_replica -> launch_new_copy_multi_replica -> build_task_partition_sstable_stat -> generate_task_partition_sstable_array -> drive_this_copy_multi_replica -> check_partition_copy_replica_stat -> build_replica_sstable_copy_task -> ObCopySSTableTask::build -> ObRebalanceTaskMgr::add_task
Multi-replica copy is the process of copying baseline data built in the single-replica construction process to other replicas. The actual data copy is completed by
ObCopySSTableTask. The task is executed by the
ObRebalanceTaskMgr of RS. The entry point is
ObCopySSTableTask::execute, which is the RPC that sends the
copy_sstable_batch. The execution entry that receives the obs of the RPC is
ObService::copy_sstable_batch. After the task of baseline data copy is completed, obs reports the result to RS. RS executes the callback
ObGlobalIndexBuilder::on_copy_multi_replica_reply and updates the status of the multi-replica copy task.
For a unique index, you need to check the uniqueness of the data in index columns. You do not need to perform this check for a non-unique index.
ObGlobalIndexBuilder::try_unique_index_calc_checksum -> launch_new_unique_index_calc_checksum -> get_checksum_calculation_snapshot -> do_get_associated_snapshot -> do_checksum_calculation -> build_task_partition_col_checksum_stat -> send_checksum_calculation_request -> send_col_checksum_calc_rpc -> drive_this_unique_index_calc_checksum
You need to select a snapshot point to check the uniqueness. After this snapshot point, the baseline of the index table can be used to see all DML operations (incremental data) on the primary table, so you can check the uniqueness of the DML process. The data before this snapshot point (stock data) can be used to calculate the
checksum of the primary and index table columns at this snapshot point. The uniqueness can be checked by comparing with checksum. At this snapshot point, all new transactions of copies can see the baseline data. Let's assume the maximum timestamp for each copy to see the baseline data is
sstable_ts. You need to wait for the context creation timestamp of all transactions to pass
sstable_ts. The function
get_checksum_calculation_snapshot completes the preceding operations and checks whether the timestamp of transaction context creation passes the
sstable_ts through the entry:
After the snapshot point is available, an RPC is sent to ask the leaders of the primary table and index table to calculate the column
checksum of the snapshot point. The processing entry for the obs that receive the RPC is
ObService::calc_column_checksum_request. After the calculation is completed, record the column checksum in the internal table
__all_index_checksum and notify RS through RPC. RS executes the callback
ObGlobalIndexBuilder::on_col_checksum_calculation_reply to update the status of the checksum calculation task. The
drive_this_unique_index_calc_checksum continuously checks the status of the checksum calculation task. If all
checksum calculations are completed, the checksum comparison is executed by
ObGlobalIndexBuilder: try_unique_index_check -> ObIndexChecksumOperator::check_column_checksum.
If all the preceding steps are completed, the
ObGlobalIndexBuilder::try_handle_index_build_take_effect function is used to make the index take effect. The schema status of the index table is modified to
INDEX_STATUS_AVAILABLE. After the central obs identifies this status, it returns a build success to the client session.
If any of the preceding steps fails, the function changes the index table status to
INDEX_STATUS_INDEX_ERROR. After the central obs identifies the status, it returns an index build failure to the client session.
After the index build process ends, the intermediate state cleanup must be performed whether the result is successful or failed, including clearing the intermediate result of SQL execution, releasing snapshots, and cleaning up internal tables.
ObGlobalIndexBuilder::try_handle_index_build_finish -> clear_intermediate_result -> ObIndexSSTableBuilder::clear_interm_result -> release_snapshot
The RS control process of local indexes is relatively simple because the RS side is not the main battlefield.
ObRSBuildIndexTask::process -> wait_trans_end -> ObIndexWaitTransStatus::get_wait_trans_status -> calc_snapshot_version -> acquire_snapshot -> wait_build_index_end -> report_index_status -> report_index_status -> release_snapshot
As the partitions of the local index are bound to the partitions of the primary table one by one, the main battlefield of the local index build is on the obs where the partitions of the primary table are located. Obs triggers the task of building a local index by monitoring the DDL changes of each tenant. After the schema of the index table is launched, obs (where the primary table is located) updates the schema and initiates the local index build task.
ObTenantDDLCheckSchemaTask::process -> process_schedule_build_index_task -> get_candidate_tables -> find_build_index_partitions -> generate_schedule_index_task -> ObBuildIndexScheduler::push_task(ObBuildIndexScheduleTask)
ObTenantDDLCheckSchemaTask will find the
partition_key to build the index, generate an
ObBuildIndexScheduleTask, and put it into the
ObBuildIndexScheduler ObDDLTaskExecutor for execution. This executor has four threads, and the queue length is limited to memory. The maximum memory of the task queue is 1GB.
How does this monitoring task come about? When the core service
partition_service of an obs starts, the sub-service
ObBuildIndexScheduler is activated.
ObBuildIndexScheduler has a scheduled task:
ObCheckTenantSchemaTask, which continuously generates the
ObTenantDDLCheckSchemaTask of each tenant and is also executed in the
ObBuildIndexScheduler. Please see
ObCheckTenantSchemaTask::runTimerTask for more information.
ObBuildIndexScheduleTask::process -> check_partition_need_build_index -> wait_trans_end -> check_trans_end -> ObPartitionService::check_schema_version_elapsed -> report_trans_status -> wait_snapshot_ready -> get_snapshot_version -> check_rs_snapshot_elapsed -> ObTsMgr::wait_gts_elapse -> ObPartitionService::check_ctx_create_timestamp_elapsed -> choose_build_index_replica -> get_candidate_source_replica -> check_need_choose_replica -> ObIndexTaskTableOperator::generate_new_build_index_record -> wait_choose_or_build_index_end -> get_candidate_source_replica -> check_need_schedule_dag -> schedule_dag -> ObPartitionStorage::get_build_index_param -> ObPartitionStorage::get_build_index_context -> ObBuildIndexDag::init -> alloc_index_prepare_task -> ObIndexPrepareTask::init -> ObIDag::add_task -> ObDagScheduler::add_dag -> copy_build_index_data -> send_copy_replica_rpc -> ObPartitionService::check_single_replica_major_sstable_exist -> unique_index_checking -> ObUniqueCheckingDag::init -> ObUniqueCheckingDag::alloc_local_index_task_callback -> ObUniqueCheckingDag::alloc_unique_checking_prepare_task -> ObUniqueCheckingPrepareTask::init -> ObIDag::add_task -> ObDagScheduler::add_dag -> wait_report_status -> check_all_replica_report_build_index_end
The overall process of building a local index is similar to a global index. After the transaction is completed and the snapshot point is available, select a copy to build a single replica. After the single replica is built and the baseline data is copied to other replicas, perform uniqueness checks before the index takes effect. The construction of baseline data is completed through
ObBuildIndexDag, and the uniqueness check is completed by
OceanBase - September 14, 2022
OceanBase - September 14, 2022
OceanBase - May 30, 2022
OceanBase - September 9, 2022
OceanBase - September 13, 2022
OceanBase - September 9, 2022
A low-code development platform to make work easierLearn More
Leverage cloud-native database solutions dedicated for FinTech.Learn More
Migrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.Learn More
Migrating to fully managed cloud databases brings a host of benefits including scalability, reliability, and cost efficiency.Learn More
More Posts by OceanBase