All Products
Search
Document Center

ApsaraDB for SelectDB:Data models

Last Updated:Mar 28, 2026

ApsaraDB for SelectDB supports three data models. Choose the right one before creating a table — the model cannot be changed afterward.

ModelBest forKey limitation
Aggregate key modelFixed-pattern report queries, pre-aggregated metricsCOUNT(*) is expensive; each value column is bound to one aggregation type
Unique key modelRelational data requiring primary key uniqueness (orders, user profiles)Cannot use ROLLUP pre-aggregation
Duplicate key modelAd hoc queries, log analysis, raw data storageNo pre-aggregation; no automatic deduplication

Background

In ApsaraDB for SelectDB, tables consist of rows and columns. Columns fall into two categories:

  • Key columns — correspond to dimension columns. Defined after the DUPLICATE KEY, AGGREGATE KEY, or UNIQUE KEY keywords in a CREATE TABLE statement.

  • Value columns — correspond to metric columns. All columns not listed as key columns.

These two categories map directly to the three data models.

Aggregate key model

The Aggregate key model pre-aggregates data at write time. When you import rows that share identical values in all key columns, SelectDB collapses them into a single row. Each value column is aggregated using the function you specify in the CREATE TABLE statement.

Use this model for report queries with fixed aggregation patterns — for example, daily active users or revenue by region. If your queries vary in aggregation logic, or if you need COUNT(*) to be fast, use the Duplicate key model or the Unique key model with MoW instead. See Limitations of the Aggregate key model for details.

Aggregation types

Aggregation typeDescription
SUMCalculates the sum across rows. Applicable to numeric values.
MINRetains the minimum value. Applicable to numeric values.
MAXRetains the maximum value. Applicable to numeric values.
REPLACEReplaces the previous value with the newly imported value. For rows with the same key columns, values are replaced in import order.
REPLACE_IF_NOT_NULLSame as REPLACE, but skips null values. Specify null (not an empty string) as the column default; otherwise empty strings are overwritten.
HLL_UNIONAggregates columns of the HyperLogLog (HLL) type using the HLL algorithm.
BITMAP_UNIONAggregates BITMAP columns using union aggregation.

Example 1: Basic aggregation on import

The example_tbl1 table records user visit data. Key columns (user_id, date, city, age, sex) identify unique records; value columns store metrics.

Column nameTypeAggregation typeComment
user_idLARGEINTN/AThe user ID.
dateDATEN/AThe date on which data is written to the table.
cityVARCHAR(20)N/AThe city in which the user resides.
ageSMALLINTN/AThe age of the user.
sexTINYINTN/AThe gender of the user.
last_visit_dateDATETIMEREPLACEThe last time when the user paid a visit.
costBIGINTSUMThe amount of money that the user spends.
max_dwell_timeINTMAXThe maximum dwell time of the user.
min_dwell_timeINTMINThe minimum dwell time of the user.
CREATE TABLE IF NOT EXISTS test.example_tbl1
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Insert the following rows:

user_iddatecityagesexlast_visit_datecostmax_dwell_timemin_dwell_time
100002017-10-01Beijing2002017-10-01 06:00:00201010
100002017-10-01Beijing2002017-10-01 07:00:001522
100012017-10-01Beijing3012017-10-01 17:05:4522222
100022017-10-02Shanghai2012017-10-02 12:59:1220055
100032017-10-02Guangzhou3202017-10-02 11:20:00301111
100042017-10-01Shenzhen3502017-10-01 10:00:1510033
100042017-10-03Shenzhen3502017-10-03 10:20:221166
INSERT INTO example_db.example_tbl_agg1 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);

After the import, SelectDB stores one aggregated row for user 10000 (rows 1 and 2 share the same key columns). All other users had unique keys, so their rows are stored unchanged:

user_iddatecityagesexlast_visit_datecostmax_dwell_timemin_dwell_time
100002017-10-01Beijing2002017-10-01 07:00:0035102
100012017-10-01Beijing3012017-10-01 17:05:4522222
100022017-10-02Shanghai2012017-10-02 12:59:1220055
100032017-10-02Guangzhou3202017-10-02 11:20:00301111
100042017-10-01Shenzhen3502017-10-01 10:00:1510033
100042017-10-03Shenzhen3502017-10-03 10:20:221166

How user 10000's row was aggregated:

  • last_visit_date (REPLACE): 2017-10-01 06:00:00 is replaced by 2017-10-01 07:00:00.

    When the REPLACE type is applied within the same import batch, the replacement order is not guaranteed. The stored value could be either timestamp. Across different batches, the later batch always wins.
  • cost (SUM): 20 + 15 = 35.

  • max_dwell_time (MAX): max(10, 2) = 10.

  • min_dwell_time (MIN): min(10, 2) = 2.

After aggregation, the original raw rows no longer exist in storage.

Example 2: Aggregating new data with existing data

This example shows how SelectDB aggregates a second import batch against already-stored data.

Create example_tbl2 (same schema as example_tbl1):

CREATE TABLE IF NOT EXISTS test.example_tbl2
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Import batch 1:

INSERT INTO test.example_tbl2 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);

Import batch 2 (adds new data for user 10004 and introduces user 10005):

INSERT INTO test.example_tbl2 VALUES
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 11:22:00",44,19,19),
(10005,"2017-10-03","Changsha",29,1,"2017-10-03 18:11:02",3,1,1);

Final stored state:

user_iddatecityagesexlast_visit_datecostmax_dwell_timemin_dwell_time
100002017-10-01Beijing2002017-10-01 07:00:0035102
100012017-10-01Beijing3012017-10-01 17:05:4522222
100022017-10-02Shanghai2012017-10-02 12:59:1220055
100032017-10-02Guangzhou3202017-10-02 11:20:00301111
100042017-10-01Shenzhen3502017-10-01 10:00:1510033
100042017-10-03Shenzhen3502017-10-03 11:22:0055196
100052017-10-03Changsha2912017-10-03 18:11:02311

User 10004's row for 2017-10-03 is aggregated across both batches (cost: 11 + 44 = 55; max_dwell_time: max(6, 19) = 19; min_dwell_time: min(6, 19) = 6). User 10005 is new, so the row is inserted as-is.

Aggregation in ApsaraDB for SelectDB happens at three stages:

  1. ETL (extract, transform, and load) stage — each import batch is internally aggregated before being written.

  2. Data compaction stage — the compute cluster merges data across different import batches in the background.

  3. Query stage — any remaining unaggregated data is aggregated at query time.

The degree of aggregation at any point is transparent. Always assume that queried data is fully aggregated.

Example 3: Retaining detailed data using the Aggregate key model

Adding a high-cardinality column to the key columns prevents rows from sharing the same key — effectively retaining all raw data while still using the Aggregate key model.

Add a timestamp column (accurate to the second) to the key:

CREATE TABLE IF NOT EXISTS test.example_tbl3
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `timestamp` DATETIME NOT NULL COMMENT "The time when data is written to the table, which is accurate to seconds",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Because every row has a unique timestamp, no two rows ever share the same key columns — so no aggregation occurs. All raw data is retained.

Limitations of the Aggregate key model

Because the Aggregate key model always presents fully aggregated data, some queries produce unexpected results or incur high cost.

COUNT(*) overhead

Consider a table example_tbl8 with the following schema:

Column nameTypeAggregation typeComment
user_idLARGEINTN/AThe user ID.
dateDATEN/AThe date on which data is written to the table.
costBIGINTSUMThe amount of money that the user spends.

Import two batches:

Batch 1:

user_iddatecost
100012017-11-2050
100022017-11-2139

Batch 2:

user_iddatecost
100012017-11-201
100012017-11-215
100032017-11-2222

The five original rows may still exist in underlying storage before compaction. Running COUNT(*):

SELECT COUNT(*) FROM example_tbl8;

Expected result: 4 (four unique key combinations). But if SelectDB scans only the user_id column and skips aggregation, it returns 3. If it counts all uncompacted rows, it returns 5. Both are wrong.

To return the correct result, the query engine must scan all key columns (user_id and date) and aggregate them — a full-key scan on every COUNT(*) call.

Workaround: Add a count column with a fixed value of 1 and aggregation type SUM:

Column nameTypeAggregation typeComment
user_idBIGINTN/AThe user ID.
dateDATEN/AThe date on which data is written to the table.
costBIGINTSUMThe amount of money that the user spends.
countBIGINTSUMRow counter.

SELECT SUM(count) FROM table; is equivalent to SELECT COUNT(*) FROM table; and runs significantly faster. Note that this approach does not support re-importing rows with the same key columns — doing so inflates the counter.

Alternatively, set the count column's aggregation type to REPLACE with a fixed value of 1. This produces the same result as COUNT(*) and allows re-importing duplicate keys.

Aggregate query semantics

Each value column is bound to one aggregation type at table creation. Running a different aggregate function on a value column may return semantically incorrect results. Design the schema around the aggregation types you need.

Unique key model

The Unique key model differs from the Aggregate key model in that it enforces primary key uniqueness rather than aggregating values. When new rows arrive with the same key columns as existing rows, only the latest imported row is retained.

Use this model for relational data with uniqueness requirements, such as orders or user profiles. It does not support ROLLUP pre-aggregation.

ApsaraDB for SelectDB provides two implementation methods:

  • Merge on Write (MoW) — removes duplicates during the import stage. Recommended for best query performance.

  • Merge on Read (MoR) — applies deduplication at query time.

MoW (recommended)

MoW marks overwritten rows as deleted during import and writes the latest data directly. At query time, deleted rows are filtered out without any merge operation. This eliminates the aggregation overhead of MoR and enables predicate pushdown in most scenarios, resulting in faster aggregate queries.

MoW is disabled by default in ApsaraDB for SelectDB V3.0. Enable it by setting "enable_unique_key_merge_on_write" = "true" when creating the table.
CREATE TABLE IF NOT EXISTS test.example_tbl6
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `phone` LARGEINT COMMENT "The phone number of the user",
    `address` VARCHAR(500) COMMENT "The address of the user",
    `register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
PROPERTIES (
    "enable_unique_key_merge_on_write" = "true"
);
MoR cannot be upgraded seamlessly to MoW because they organize data differently. To migrate, use INSERT INTO unique-mow-table SELECT * FROM source_table to import existing data into a new MoW table.
The unique deletion sign and sequence column still work with MoW enabled.

MoR

MoR applies deduplication logic at query time by merging data on read. It is equivalent to the Aggregate key model with all value columns set to REPLACE.

CREATE TABLE IF NOT EXISTS test.example_tbl4
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `phone` LARGEINT COMMENT "The phone number of the user",
    `address` VARCHAR(500) COMMENT "The address of the user",
    `register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

The MoR Unique key model above is internally identical to the following Aggregate key model with all value columns using REPLACE:

CREATE TABLE IF NOT EXISTS test.example_tbl5
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) REPLACE COMMENT "The city in which the user resides",
    `age` SMALLINT REPLACE COMMENT "The age of the user",
    `sex` TINYINT REPLACE COMMENT "The gender of the user",
    `phone` LARGEINT REPLACE COMMENT "The phone number of the user",
    `address` VARCHAR(500) REPLACE COMMENT "The address of the user",
    `register_time` DATETIME REPLACE COMMENT "The time when the user is registered"
)
AGGREGATE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Duplicate key model

The Duplicate key model differs from the other two in that it stores all rows exactly as imported, with no aggregation and no uniqueness enforcement. Multiple rows with the same key column values coexist without affecting each other.

Use this model for ad hoc queries on raw data such as logs, where you need the full detail and do not need deduplication or pre-aggregation.

The key columns specified in the CREATE TABLE statement act as sort keys only, not unique identifiers. We recommend that you select the first two to four columns as the duplicate key.

CREATE TABLE IF NOT EXISTS test.example_tbl7
(
    `timestamp` DATETIME NOT NULL COMMENT "The time when the log was generated",
    `type` INT NOT NULL COMMENT "The type of the log",
    `error_code` INT COMMENT "The error code",
    `error_msg` VARCHAR(1024) COMMENT "The error message",
    `op_id` BIGINT COMMENT "The owner ID",
    `op_time` DATETIME COMMENT "The time when the error was handled"
)
DUPLICATE KEY(`timestamp`, `type`, `error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 16;
ColumnTypeSort keyComment
timestampDATETIMEYesThe time when the log was generated.
typeINTYesThe type of the log.
error_codeINTYesThe error code.
error_msgVARCHAR(1024)NoThe error message.
op_idBIGINTNoThe owner ID.
op_timeDATETIMENoThe time when the error was handled.

Key columns across models

Key columns serve different roles depending on the model:

ModelRole of key columns
Duplicate key modelSort keys only — not unique identifiers
Aggregate key modelSort keys and unique identifiers
Unique key modelSort keys and unique identifiers

Choose a data model

ScenarioRecommended model
Fixed-pattern aggregation reports (for example, daily active users, revenue by region)Aggregate key model
Relational data with primary key constraints (for example, orders, user profiles)Unique key model (MoW)
High-performance aggregate queries on primary-key dataUnique key model with MoW enabled
Raw data analysis, log queries, ad hoc explorationDuplicate key model
Partial column updatesUnique key model — see Partial Update

When not to use the Aggregate key model: Avoid it when your COUNT(*) queries must be fast, or when you need to run aggregate functions that differ from the aggregation types defined at table creation. The Aggregate key model is only efficient when rows collapse significantly — if most rows have unique keys, the pre-aggregation overhead outweighs the benefit.

How MoW improves COUNT(*) performance

MoW uses a delete bitmap to track overwritten rows rather than aggregating at query time. Using the same two-batch scenario from the Aggregate key model limitation above:

After batch 1:

user_iddatecostdelete bit
100012017-11-2050false
100022017-11-2139false

After batch 2 (batch 1's duplicate row is marked deleted):

user_iddatecostdelete bit
100012017-11-2050true
100022017-11-2139false
user_iddatecostdelete bit
100012017-11-201false
100012017-11-215false
100032017-11-2222false

COUNT(*) skips rows with delete bit = true and scans only one column, returning 4 correctly with minimal overhead. In the test environment, COUNT(*) queries by using the MoW implementation method of the Unique key model deliver 10 times higher performance than those by using the Aggregate key model.