Community Blog An Interpretation of the Source Code of OceanBase (3): Life of Partition

An Interpretation of the Source Code of OceanBase (3): Life of Partition

This article introduces the storage layer of OceanBase and the life of a partition.

By Zhuweng, OceanBase R&D Director


The source code is the most fundamental part of OceanBase. This series of articles focuses on the interpretation of the source code to help users understand the nature of the database. The second article of this series (Interpretation of the Source Code of OceanBase (2): Life of SQL) focused on the main path of the execution process of an SQL in OceanBase, including receiving, processing, and feedback to the client, and the module of the SQL engine of OceanBase.

The third article of this series introduces the storage layer of OceanBase.


Let’s assume a partition is a first-class citizen in OceanBase. A table consists of one or several partitions. A partition is a horizontally scalable concept in logic and a basic unit for data organization in the physics of OceanBase. A partition is self-contained. Each partition has its independent election and leader, independent transaction log, data storage, and index. A partition can be large or small, but its capacity is usually measured in GB. It can be distributed by RS among multiple nodes to achieve load balancing (LB).

Each partition has an exclusive identifier OBPartitionKey and consists of tenant id, table id, and partition id. Each node stores and organizes the copies of a partition corresponding to these key elements on the storage layer. Tens of thousands of copies can exist on each node. Copies are various. For example, the Paxos protocol is not adopted between the read-only copy and the primary copy. Instead, the two types of copies do not update messages at the same time but update transaction logs at the same time.


Storage/ob_partition_service.h serves as the overall access to the storage layer. It provides all services under the Remote Procedure Call (RPC) protocol on the storage layer, such as creating or deleting copies of a partition. It is the access for all partition-level interfaces on each node, including transaction control interfaces and interfaces for read/write of a partition.

As mentioned earlier, RS executes Data Definition Language (DDL) statements (such as table creation and partition addition). After determining nodes according to certain methods, RS calls create_xxx here through RPC protocol.

If you go deep by layers from the access, you can find all storage structures. However, a large number of similar interfaces exist inside the storage layer and are confusing. Each partition is an index-organized table and has an index structure of a multi-layer log-structured merge-tree (LSM-Tree). The structure has multiple layers, but you can adjust the parameters. Two layers are better: memtable in memory (see storage/memtable) and major sstable on disk (see storage/blocksstable). Each time major compaction is conducted, the data of the memtable and minor sstable are combined with the earlier major sstable to generate a new version of the major sstable. Multiple versions of major sstable provide services at the same time during the combination. The multi-layer storage structure of a partition replica combined constitutes a class OBPartitionStore.

Now, do you still think Partition is the first-class citizen of OceanBase? Partition Group (PG) is, and it indicates a group of closely connected partitions that share the same transaction log stream and memtable. This feature is generally not used externally. A PG generally comprises one partition, which is why the storage structure and operation objects here have strange PG prefixes and OBPGKey.

Please see the class diagram for the storage structure I drew for a clearer idea:

A diagram speaks louder than words.


Sql/engine/table is the operator that performs table scanning in the physical execution plan in SQL. It obtains an iterator (revert_scan_iter to be called at the end of iteration) through the table_scan interface of ob_partition_service.h. This is the data access provided by the storage layer. Interfaces used by DML (sql/engine/dml) also include insert_rows, delete_rows, update_rows, and lock_rows.

When a partition or a table is deleted, the OBPGPartition and all its storage structures are deleted. However, the resources are not released immediately and are cleared when all references are invalid and the resources are not used.

This is the life of partition. The fourth article of this series will analyze the external interfaces of database transactions of OceanBase.

0 0 0
Share on


16 posts | 0 followers

You may also like