DuckDB Internals - Part 2: Table Storage Format

By Zongzhi Chen, Manager of Alibaba Cloud RDS Team

Foreword

In the previous article, we provided an overview of the DuckDB file format and metadata storage, omitting the details of the table storage format due to its complexity. This article provides a detailed analysis of the table storage format based on the DuckDB v1.3.1 source code. A TL;DR summary is available at the end.

Before analyzing table data storage, it is essential to understand the data structures related to the catalog. The catalog is the primary entry point for accessing a DuckDB file, and it manages various CatalogEntry objects in memory. Tables are represented by a DuckTableEntry, which contains a crucial pointer to a DataTable object. The following sections focus on the DataTable class to explain its data structures and on-disk storage format.

• An AttachedDatabase (corresponding to a database file) contains a Catalog.

• A Catalog (corresponding to the data dictionary) contains multiple Schemas.

• A Schema (similar to a database in MySQL) contains multiple tables, indexes, functions, and other objects.

• The focus of this article, the table, is represented in the catalog as a DuckTableEntry object. It contains a crucial member variable: a pointer to a DataTable class. The following sections will revolve around the DataTable class to explain the table's data structures and its storage format in the file.

Hierarchical Overview

From a high-level perspective, table storage is organized into a four-layer hierarchy:

• A Table is horizontally partitioned by rows into multiple Row Groups.

• A Row Group is vertically partitioned by columns into multiple Column Data objects.

• A Column Data object is further horizontally partitioned by rows into multiple Column Segments.

• A Column Segment represents the actual stored data. It typically corresponds to a single 256-KB Data Block but may also share a block with other segments.

Note: The visual ordering of columns in diagrams is for clarity only. In practice, all column data is stored according to the row order, identified by a Row ID. This allows for precisely locating all data for a given row, a key dependency for the Segment Tree discussed later.

Mapping these concepts to the source code reveals the following data structures:

• A Table corresponds to a DataTable object, which contains a pointer to a RowGroupCollection. This collection uses a RowGroupSegmentTree to manage the table's RowGroup objects (horizontal partitioning).

• A RowGroup contains an array of ColumnData objects, one for each column in the table (vertical partitioning).

• ColumnData has several subclasses, such as StandardColumnData, ValidityColumnData, and ListColumnData. A StandardColumnData object also contains a ValidityColumnData object. This "validity column" stores the NULL status for the main data column and has its own separate Column Data and Column Segments.

A ColumnData object uses a ColumnSegmentTree to manage its ColumnSegment objects (horizontal partitioning).

A (block_id, offset) pair specifies the location of a ColumnSegment within a data block.

Segment Tree

As shown, both horizontal partitioning steps use a Segment Tree to manage the subsequent level of the hierarchy. RowGroup and ColumnSegment both inherit from SegmentBase<T>, allowing them to function as nodes within a SegmentTree. The SegmentBase class contains two key fields:

start: The starting Row ID of the segment.
count: The number of rows in the segment.

Each node in a SegmentTree represents a range of rows. The SegmentTree itself is not a tree but an ordered array, enabling fast binary searches on Row ID to locate the correct segment node.

bool TryGetSegmentIndex(SegmentLock &l, idx_t row_number, idx_t &result) {
    ... /* In lazy loading mode, loads SegmentNodes until row_number is covered. */
    idx_t lower = 0;
    idx_t upper = nodes.size() - 1;

    /* Binary search to locate the SegmentNode containing the row_number. */
    while (lower <= upper) {
        idx_t index = (lower + upper) / 2;
        auto &entry = nodes[index];
        if (row_number < entry.row_start) {
            upper = index - 1;
        } else if (row_number >= entry.row_start + entry.node->count) {
            lower = index + 1;
        } else {
            result = index;
            return true;
        }
    }
    return false;
}

The primary difference between RowGroupSegmentTree and ColumnSegmentTree is that the former supports lazy loading of metadata, while the latter does not. This behavior is controlled by a template parameter. "Loading" here refers to metadata (pointers), not the data blocks themselves.

template <class T, bool SUPPORTS_LAZY_LOADING = false>
class SegmentTree {
    /* With SUPPORTS_LAZY_LOADING, lazy loading is triggered. */
    /* RowGroupSegmentTree implements its own LoadSegment method. */
    bool LoadNextSegment(SegmentLock &l) {
        if (!SUPPORTS_LAZY_LOADING) {
            return false;
        }
        if (finished_loading) {
            return false;
        }
        auto result = LoadSegment();
        if (result) {
            AppendSegmentInternal(l, std::move(result));
            return true;
        }
        return false;
    }
};

/* SUPPORTS_LAZY_LOADING=true, allows lazy loading. */
class RowGroupSegmentTree : public SegmentTree<RowGroup, true> {
    ...
    unique_ptr<RowGroup> LoadSegment() override;
    ...
};

/* SUPPORTS_LAZY_LOADING=false. All ColumnSegment metadata is loaded on initialization.*/
    -> RowGroup::GetColumn()
    --> ColumnData::Deserialize()
    ---> ColumnData::InitializeColumn()
    ----> ColumnData::AppendSegment() */
class ColumnSegmentTree : public SegmentTree<ColumnSegment> {};

Table Storage Format Overview

A table's data is physically stored across four distinct locations:

Catalog Metadata (in the blue box): The TableCatalogEntry resides in the main catalog's meta block list. It contains a pointer to the table's structural metadata. This part is managed by the metadata_writer.
Table Structural Metadata (in the red box): A hierarchy from Row Groups down to Column Segments is stored in a dedicated table meta block list. This part is managed by the table_metadata_writer.
Version Info (in the orange box): Each Row Group has version information, used for tracking deleted rows, which is stored in a separate meta block list.
Actual Data (in the purple box): The table's actual data resides in 256-KB Data Blocks.

During a checkpoint, two MetadataWriter instances are used: metadata_writer for catalog entries and table_metadata_writer for detailed table metadata. Note that the similar names can be a point of confusion.

void SingleFileCheckpointWriter::CreateCheckpoint() {
    ...
    /* 1. Create two MetadataWriter objects for the catalog and table metadata.*/
 
 
    metadata_writer = make_uniq<MetadataWriter>(metadata_manager);
    table_metadata_writer = make_uniq<MetadataWriter>(metadata_manager);

    /* 2. Scan to get all Catalog Entries.*/
    ...

    /* 3. Serialize all Catalog Entries using the metadata_writer.*/
 
 
    BinarySerializer serializer(*metadata_writer, SerializationOptions(db));
    serializer.Begin();
    serializer.WriteList(100, "catalog_entries", catalog_entries.size(), [&](Serializer::List &list, idx_t i) {
        auto &entry = catalog_entries[i];
        list.WriteObject([&](Serializer &obj) { WriteEntry(entry.get(), obj); });
    });
    serializer.End();
    ...
}

As the code below shows, table_metadata_writer is passed during the initialization of both SingleFileTableDataWriter and SingleFileRowGroupWriter. Therefore, table_metadata_writer will be used to serialize table metadata in both RowGroup::Checkpoint and SingleFileTableDataWriter::FinalizeTable.

unique_ptr<TableDataWriter> SingleFileCheckpointWriter::GetTableDataWriter(TableCatalogEntry &table) {
    return make_uniq<SingleFileTableDataWriter>(*this, table, *table_metadata_writer);
}

unique_ptr<RowGroupWriter> SingleFileTableDataWriter::GetRowGroupWriter(RowGroup &row_group) {
    return make_uniq<SingleFileRowGroupWriter>(table, checkpoint_manager.partial_block_manager, *this,
                                               table_data_writer);
}

Write Process

SingleFileCheckpointWriter::WriteTable
├── TableDataWriter::WriteTableData
│   └── DataTable::Checkpoint
│       ├── RowGroupCollection::Checkpoint
│       │   ├── CheckpointTask::ExecuteTask
│       │   │   └── RowGroup::WriteToDisk
│       │   │       └── StandardColumnData::Checkpoint
│       │   │           ├── ColumnDataCheckpointer::Checkpoint // Checks for modifications, returns if none.
│       │   │           │   └── ColumnDataCheckpointer::WriteToDisk
│       │   │           │       ├── ColumnDataCheckpointer::DropSegments
│       │   │           │       ├── ColumnDataCheckpointer::ScanSegments
│       │   │           │       └── UncompressedFunctions::FinalizeCompress
│       │   │           └── ColumnDataCheckpointer::FinalizeCheckpoint
│       │   └── RowGroup::Checkpoint
│       └── SingleFileTableDataWriter::FinalizeTable
└── PartialBlockManager::FlushPartialBlocks

The call stack for writing table data is top-down, but the data flow is bottom-up, as lower-level components must return pointers to the upper levels.

• ColumnDataCheckpointer::WriteToDisk handles data block operations. It scans existing column segments, applies compression, merges segments, and writes new ones. It generates DataPointer objects that add them to ColumnCheckpointState::data_pointers, constituting the Column Data's metadata.

• RowGroup::Checkpoint serializes the data_pointers for all its columns into the table meta block list and serializes version info into a separate list. It constructs a RowGroupPointer object containing pointers to this metadata.

• RowGroupCollection::Checkpoint collects the RowGroupPointer from each row group.

• SingleFileTableDataWriter::FinalizeTable serializes all RowGroupPointers and table statistics. Finally, it writes a pointer to this metadata block into the TableCatalogEntry.

To pass serialization results (pointers) from the bottom to the top, DuckDB uses a series of XXXCheckpointState classes to store context. The UML diagram below shows the relationship between these classes.

• CollectionCheckpointState corresponds to a Row Group Collection and contains multiple RowGroupWriteData objects.

• RowGroupWriteData corresponds to a Row Group and contains multiple ColumnCheckpointState objects.

• ColumnCheckpointState corresponds to a Column Data and contains multiple DataPointer objects. Each DataPointer represents a ColumnSegment and points to its storage location in a data block.

Data Persistence and Column Data Metadata Preparation

This stage generates DataPointer objects, each representing a ColumnSegment. The process starts at ColumnDataCheckpointer::WriteToDisk. The call stack is as follows:

ColumnDataCheckpointer::WriteToDisk
├── ColumnDataCheckpointer::DropSegments
│   └── ColumnSegment::CommitDropSegment
│       └── SingleFileBlockManager::MarkBlockAsModified // Marks old data blocks as modified.
├── ColumnDataCheckpointer::ScanSegments
│   ├── ColumnData::CheckpointScan // Scans in chunks of 2048 rows.
│   │   └── UpdateSegment::FetchCommittedRange
│   │       └── GetFetchCommittedRangeFunction
│   │           └── TemplatedFetchCommittedRange
│   │               └── MergeUpdateInfoRange // Applies committed updates.
│   └── callback -> UncompressedFunctions::Compress // Loads data into a new Column Segment.
│       ├── UncompressedCompressState::FlushSegment // Flushes a full segment to a disk-backed data block.
│       │   └── ColumnCheckpointState::FlushSegmentInternal
│       │       ├── PartialBlockManager::GetBlockAllocation
│       │       │   └── PartialBlockManager::AllocateBlock
│       │       └── PartialBlockManager::RegisterPartialBlock
│       └── UncompressedCompressState::CreateEmptySegment // Allocates a new segment and a 256KB in-memory block.
└── UncompressedFunctions::FinalizeCompress // Finalizes the last segment, essentially calling FlushSegment.
    └── UncompressedCompressState::Finalize
        └── UncompressedCompressState::FlushSegment

The entry function ColumnDataCheckpointer::WriteToDisk is straightforward:

Mark old blocks: Mark all data blocks of old column segments as modified. They will be added to the free list for reuse.
Determine compression: Select a compression algorithm (we assume uncompressed for simplicity).
Scan and write segments: Iterate through existing segments, apply updates, and append the data to new Column Segments. When a segment is full, a new one is allocated. Using UncompressedFunctions::Compress as an example, this is a simple append. When a segment is full, a new one is allocated with a temporary 256-KB in-memory data block.
Finalize: Finalize the last column segment. For UncompressedCompressState::Finalize, this essentially calls FlushSegment.

void ColumnDataCheckpointer::WriteToDisk() {
    /* 1. Mark all data blocks of old column segments as modified.*/ 
    /*    These blocks will be added to the free list and discarded.*/
    DropSegments();

    /* 2. Detect and determine the compression algorithm. We will assume uncompressed.*/
    ...

    /* 3. Iterate over existing column segments, reading in chunks, applying updates,*/
    /*    and then calling a callback to append data to new column segments.*/ 
 
    ScanSegments([&](Vector &scan_vector, idx_t count) {
        for (idx_t i = 0; i < checkpoint_states.size(); i++) {
            if (!has_changes[i]) {
                continue;
            }
            auto &function = analyze_result[i].function;
            auto &compression_state = compression_states[i];
            function->compress(*compression_state, scan_vector, count);
        }
    });

    /* 4. Finalize the last column segment.*/
 
    for (idx_t i = 0; i < checkpoint_states.size(); i++) {
        if (!has_changes[i]) {
            continue;
        }
        auto &function = analyze_result[i].function;
        auto &compression_state = compression_states[i];
        function->compress_finalize(*compression_state);
    }
}

When a column segment is full (typically 256 KB) or at the end of the process, ColumnCheckpointState::FlushSegmentInternal is called:

1. Merge statistics: Merge the segment's statistics into the parent Column Data's statistics.

2. Check for constants: If a segment contains only constant values, only its statistics but not the data are stored. This occurs in two cases:

• A validity column that is either all NULL or all not NULL.

• A numeric column where all values are identical.

3. For non-constant segments, the data must be written to a data block:

If the segment size is greater than 80% of 256 KB, a new 256-KB disk-backed data block is allocated for it. The temporary in-memory block is assigned a block ID, effectively making it persistent.
Otherwise, the system tries to write it to an existing, partially filled data block. If none is found, a new block is allocated as in step 3.i.

4. If a data block becomes full and will not be reused, it is flushed to the file immediately. Otherwise, it is registered with the PartialBlockManager and flushed later during PartialBlockManager::FlushPartialBlocks.

5. Construct a DataPointer that points to the data. This represents the ColumnSegment and includes statistics, a block pointer (Block ID, offset), Start Row ID, and Count. This pointer is then added to the data_pointers array, preparing the metadata for the Column Data.

void ColumnCheckpointState::FlushSegmentInternal(unique_ptr<ColumnSegment> segment, idx_t segment_size){
    ...
    /* 1. Merge the segment's statistics into the column data's global statistics.*/
    global_stats->Merge(segment->stats.statistics);
    ...

    unique_lock<mutex> partial_block_lock;
    /* 2. Check if the segment stores a constant value.
          - A validity column (all NULLs or all non-NULLs).
          - A numeric column (all values are the same).
    if (!segment->stats.statistics.IsConstant()) {
        partial_block_lock = partial_block_manager.GetLock();

        /* 3. For non-constant segments, decide whether to allocate a new data block based on size.
              If the segment size is greater than 80% of 256 KB, a new block is required. Otherwise, try to fit it into a partial block. */
        PartialBlockAllocation allocation =
            partial_block_manager.GetBlockAllocation(NumericCast<uint32_t>(segment_size));
        block_id = allocation.state.block_id;
        offset_in_block = allocation.state.offset;

        if (allocation.partial_block) { /* Reuse an existing partial block */
            auto &pstate = allocation.partial_block->Cast<PartialBlockForCheckpoint>();
            auto old_handle = buffer_manager.Pin(segment->block);
            auto new_handle = buffer_manager.Pin(pstate.block_handle);
            /* Copy the content */
            memcpy(new_handle.Ptr() + offset_in_block, old_handle.Ptr(), segment_size);
            pstate.AddSegmentToTail(column_data, *segment, offset_in_block);
        } else { /* Allocate a new disk-backed data block */
            if (segment->SegmentSize() != block_size) {
                segment->Resize(block_size);
            }
            allocation.partial_block = make_uniq<PartialBlockForCheckpoint>(column_data, *segment, allocation.state,
                                                                            *allocation.block_manager);
        }

        /* 4. If the block is full, flush it. Otherwise, register it with the PartialBlockManager 
              to be flushed later in PartialBlockManager::FlushPartialBlocks. */
        partial_block_manager.RegisterPartialBlock(std::move(allocation));
    } else {
        /* For constant segments, no data block is needed. */
        segment->ConvertToPersistent(nullptr, INVALID_BLOCK);
    }

    /* 5. Construct a DataPointer to the column segment, including stats, block pointer, 
          start row, and count, and add it to the data_pointers array. */
    DataPointer data_pointer(segment->stats.statistics.Copy());
    data_pointer.block_pointer.block_id = block_id;
    data_pointer.block_pointer.offset = offset_in_block;
    data_pointer.row_start = row_group.start;
    if (!data_pointers.empty()) {
        auto &last_pointer = data_pointers.back();
        data_pointer.row_start = last_pointer.row_start + last_pointer.tuple_count;
    }
    data_pointer.tuple_count = tuple_count;
    ...
    new_tree.AppendSegment(std::move(segment));
    data_pointers.push_back(std::move(data_pointer));
}

Persistence of Column Data Metadata

The metadata for Column Data is persisted within RowGroup::Checkpoint. During serialization, pointers to the serialized content are generated and added to a RowGroupPointer, which is then returned to the upper layer. The function is simple:

1. Merge the statistics of all Column Data into the table's global statistics.

2. For each ColumnCheckpointState corresponding to a Column Data:

Get a meta block pointer (Block ID + Meta Block Index + offset) and add it to row_group_pointer.data_pointers. This pointer will point to the metadata that is about to be serialized.
Serialize the Column Data metadata (the data_pointers array from the previous step, and the data_pointers array for the validity column) into the meta block list managed by table_metadata_writer.

3. Serialize the row group's version info (which tracks deleted rows) into a new, separate meta block list, and store pointers to all of its meta blocks in row_group_pointer.deletes_pointers.

RowGroupPointer RowGroup::Checkpoint(RowGroupWriteData write_data, RowGroupWriter &writer,
                                     TableStatistics &global_stats){
    RowGroupPointer row_group_pointer;

    auto lock = global_stats.GetLock();
    /* 1. Merge all column data statistics into the table's global statistics. */
    for (idx_t column_idx = 0; column_idx < GetColumnCount(); column_idx++) {
        global_stats.GetStats(*lock, column_idx).Statistics().Merge(write_data.statistics[column_idx]);
    }

    row_group_pointer.row_start = start;
    row_group_pointer.tuple_count = count;
    /* 2. Iterate over all ColumnCheckpointState objects. */
    for (auto &state : write_data.states) {
        /* This is the table_metadata_writer. */
        auto &data_writer = writer.GetPayloadWriter();
        auto pointer = data_writer.GetMetaBlockPointer();

        /* 2.i Add the meta block pointer to row_group_pointer.data_pointers. */
        row_group_pointer.data_pointers.push_back(pointer);

        /* 2.ii Serialize the column data metadata into the table_metadata_writer's meta block list. */
        auto persistent_data = state->ToPersistentData();
        BinarySerializer serializer(data_writer);
        serializer.Begin();
        persistent_data.Serialize(serializer);
        serializer.End();
    }

    /* 3. Serialize the row group's version info into a new meta block list and
          store the pointers in row_group_pointer.deletes_pointers. */
    row_group_pointer.deletes_pointers = CheckpointDeletes(writer.GetPayloadWriter().GetManager());
    Verify();
    return row_group_pointer;
}

The specific content of the Column Data metadata consists of two parts:

• The data_pointers array, where each DataPointer object represents a ColumnSegment and contains:

Its start Row ID, row count, block pointer, compression type, statistics, and state.

• The metadata for the validity column, which is also a DataPointer array with the same structure.

void PersistentColumnData::Serialize(Serializer &serializer) const{
    serializer.WritePropertyWithDefault(100, "data_pointers", pointers);
    if (child_columns.empty()) {
        return;
    }
    serializer.WriteProperty(101, "validity", child_columns[0]);
    ... /* For ARRAY and LIST types */
}

void DataPointer::Serialize(Serializer &serializer) const{
    serializer.WritePropertyWithDefault<uint64_t>(100, "row_start", row_start);
    serializer.WritePropertyWithDefault<uint64_t>(101, "tuple_count", tuple_count);
    serializer.WriteProperty<BlockPointer>(102, "block_pointer", block_pointer);
    serializer.WriteProperty<CompressionType>(103, "compression_type", compression_type);
    serializer.WriteProperty<BaseStatistics>(104, "statistics", statistics);
    serializer.WritePropertyWithDefault<unique_ptr<ColumnSegmentState>>(105, "segment_state", segment_state);
}

Persistence of Row Group Metadata

SingleFileTableDataWriter::FinalizeTable can be seen as the final step in writing table metadata. It includes persisting row group metadata and serializing the final TableCatalogEntry. Note that steps 2-3 and step 4 write to different meta block lists:

Get the current meta block pointer from table_metadata_writer. This pointer will point to the content about to be serialized.
Serialize the table's global statistics into the meta block list of table_metadata_writer.
Serialize the number of row groups and all RowGroupPointers into the meta block list of table_metadata_writer.
Serialize the meta block pointer from step 1, the total row count, and index information into the meta block list of metadata_writer (i.e., the Catalog).

void SingleFileTableDataWriter::FinalizeTable(const TableStatistics &global_stats, DataTableInfo *info,
                                              Serializer &serializer){
    /* 1. Get the current meta block pointer from table_data_writer. */
    auto pointer = table_data_writer.GetMetaBlockPointer();

    /* 2. Serialize the table's global statistics into table_data_writer's meta block list. */
    BinarySerializer stats_serializer(table_data_writer, serializer.GetOptions());
    stats_serializer.Begin();
    global_stats.Serialize(stats_serializer);
    stats_serializer.End();

    /* 3. Serialize the row group count and all RowGroupPointers into
          table_data_writer's meta block list. */
    table_data_writer.Write<uint64_t>(row_group_pointers.size());
    idx_t total_rows = 0;
    for (auto &row_group_pointer : row_group_pointers) {
        auto row_group_count = row_group_pointer.row_start + row_group_pointer.tuple_count;
        if (row_group_count > total_rows) {
            total_rows = row_group_count;
        }

        BinarySerializer row_group_serializer(table_data_writer, serializer.GetOptions());
        row_group_serializer.Begin();
        RowGroup::Serialize(row_group_pointer, row_group_serializer);
        row_group_serializer.End();
    }

    /* 4. Serialize the meta block pointer from step 1, total row count, and index info
          into metadata_writer's meta block list (the Catalog). */
    serializer.WriteProperty(101, "table_pointer", pointer);
    serializer.WriteProperty(102, "total_rows", total_rows);
    ...
    auto index_storage_infos = info->GetIndexes().GetStorageInfos(options);
    vector<BlockPointer> compat_block_pointers; /* Serialize an empty array for backward compatibility */
    serializer.WriteProperty(103, "index_pointers", compat_block_pointers);
    serializer.WritePropertyWithDefault(104, "index_storage_infos", index_storage_infos);
}

The specific content of the Row Group metadata consists of four parts:

• Start Row ID

• Row count

• An array of meta block pointers to all Column Data metadata.

• An array of meta block pointers to all Version Info blocks.

void RowGroup::Serialize(RowGroupPointer &pointer, Serializer &serializer) {
    serializer.WriteProperty(100, "row_start", pointer.row_start);
    serializer.WriteProperty(101, "tuple_count", pointer.tuple_count);
    serializer.WriteProperty(102, "data_pointers", pointer.data_pointers);
    serializer.WriteProperty(103, "delete_pointers", pointer.deletes_pointers);
}

Content of Table Metadata

CheckpointWriter::WriteEntry
└── SingleFileCheckpointWriter::WriteTable
    ├── CatalogEntry::Serialize //  Deep call stack, actual method is shown below:
    │   ├── TableCatalogEntry::GetInfo
    │   └── CreateTableInfo::Serialize
    └── TableDataWriter::WriteTableData
        └── DataTable::Checkpoint
            └── SingleFileTableDataWriter::FinalizeTable

The process of writing a TableCatalogEntry to the Catalog is distributed. The code snippets below are combined to show the full picture of its components:

1. The Catalog Entry type, which is TABLE_ENTRY.

2. The serialized DuckTableEntry object, which includes:

Common CatalogEntry properties, such as the schema name.
Table-specific information like table name, column definitions, and constraints.

3. After the row group metadata is written, the pointer to it is recorded as table_pointer, along with the total row count and index information.

void CheckpointWriter::WriteEntry(...){
    /* 1. Catalog Entry type, such as TABLE_ENTRY */
    serializer.WriteProperty(99, "catalog_type", entry.type);
    ...
}

void SingleFileCheckpointWriter::WriteTable(...){
    /* 2. Serialize the DuckTableEntry object. */
    serializer.WriteProperty(100, "table", &table);
    ...
}

void CreateTableInfo::Serialize(Serializer &serializer) const{
    /* 2.i Common CatalogEntry properties, such as schema name. */
    CreateInfo::Serialize(serializer);

    /* 2.ii Table name, column definitions, constraints, ... */
    serializer.WritePropertyWithDefault<string>(200, "table", table);
    serializer.WriteProperty<ColumnList>(201, "columns", columns);
    serializer.WritePropertyWithDefault<vector<unique_ptr<Constraint>>>(202, "constraints", constraints);
    serializer.WritePropertyWithDefault<unique_ptr<SelectStatement>>(203, "query", query);
}

void SingleFileTableDataWriter::FinalizeTable(...){
    ...
    /* 3. After writing row group metadata, record its pointer as table_pointer, along with row count and index info.*/
    serializer.WriteProperty(101, "table_pointer", pointer);
    serializer.WriteProperty(102, "total_rows", total_rows);
    ...
    auto index_storage_infos = info->GetIndexes().GetStorageInfos(options);
    vector<BlockPointer> compat_block_pointers; /* Serialize an empty array for backward compatibility */
    serializer.WriteProperty(103, "index_pointers", compat_block_pointers);
    serializer.WritePropertyWithDefault(104, "index_storage_infos", index_storage_infos);
}

TL;DR

This article focused on the table storage format. The key takeaways include:

• A table's storage is organized into a 4-level hierarchy:

A Table is horizontally partitioned by rows into multiple Row Groups.
A Row Group is vertically partitioned by columns into multiple Column Data segments.
A Column Data segment is horizontally partitioned by rows into multiple Column Segments.
A Column Segment represents the actual data, typically corresponding to a 256-KB data block, but may also share one.

• A table's data is stored in four distinct locations on disk:

TableCatalogEntry in the main catalog's meta block list.
Structural metadata (Row Group, Column Data, Column Segment info) in a dedicated table meta block list.
Version info for deletions in a separate meta block list per row group.
Actual data in 256-KB Data Blocks.

• Table data is persisted during a checkpoint, with the process integrated into the serialization of its TableCatalogEntry.

Community

DuckDB Internals - Part 2: Table Storage Format

Foreword

Hierarchical Overview

Segment Tree

Table Storage Format Overview

Write Process

Data Persistence and Column Data Metadata Preparation

Persistence of Column Data Metadata

Persistence of Row Group Metadata

Content of Table Metadata

TL;DR

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Database for FinTech Solution

Oracle Database Migration Solution

Database Migration Solution

DBStack