ApsaraDB for SelectDB:Data models - ApsaraDB for SelectDB

ApsaraDB for SelectDB supports three data models. Choose the right one before creating a table — the model cannot be changed afterward.

Model	Best for	Key limitation
Aggregate key model	Fixed-pattern report queries, pre-aggregated metrics	`COUNT(*)` is expensive; each value column is bound to one aggregation type
Unique key model	Relational data requiring primary key uniqueness (orders, user profiles)	Cannot use ROLLUP pre-aggregation
Duplicate key model	Ad hoc queries, log analysis, raw data storage	No pre-aggregation; no automatic deduplication

Background

In ApsaraDB for SelectDB, tables consist of rows and columns. Columns fall into two categories:

Key columns — correspond to dimension columns. Defined after the DUPLICATE KEY, AGGREGATE KEY, or UNIQUE KEY keywords in a CREATE TABLE statement.
Value columns — correspond to metric columns. All columns not listed as key columns.

These two categories map directly to the three data models.

Aggregate key model

The Aggregate key model pre-aggregates data at write time. When you import rows that share identical values in all key columns, SelectDB collapses them into a single row. Each value column is aggregated using the function you specify in the CREATE TABLE statement.

Use this model for report queries with fixed aggregation patterns — for example, daily active users or revenue by region. If your queries vary in aggregation logic, or if you need COUNT(*) to be fast, use the Duplicate key model or the Unique key model with MoW instead. See Limitations of the Aggregate key model for details.

Aggregation types

Aggregation type	Description
`SUM`	Calculates the sum across rows. Applicable to numeric values.
`MIN`	Retains the minimum value. Applicable to numeric values.
`MAX`	Retains the maximum value. Applicable to numeric values.
`REPLACE`	Replaces the previous value with the newly imported value. For rows with the same key columns, values are replaced in import order.
`REPLACE_IF_NOT_NULL`	Same as `REPLACE`, but skips null values. Specify `null` (not an empty string) as the column default; otherwise empty strings are overwritten.
`HLL_UNION`	Aggregates columns of the HyperLogLog (HLL) type using the HLL algorithm.
`BITMAP_UNION`	Aggregates BITMAP columns using union aggregation.

Example 1: Basic aggregation on import

The example_tbl1 table records user visit data. Key columns (user_id, date, city, age, sex) identify unique records; value columns store metrics.

Column name	Type	Aggregation type	Comment
user_id	LARGEINT	N/A	The user ID.
date	DATE	N/A	The date on which data is written to the table.
city	VARCHAR(20)	N/A	The city in which the user resides.
age	SMALLINT	N/A	The age of the user.
sex	TINYINT	N/A	The gender of the user.
last_visit_date	DATETIME	REPLACE	The last time when the user paid a visit.
cost	BIGINT	SUM	The amount of money that the user spends.
max_dwell_time	INT	MAX	The maximum dwell time of the user.
min_dwell_time	INT	MIN	The minimum dwell time of the user.

CREATE TABLE IF NOT EXISTS test.example_tbl1
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Insert the following rows:

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	Beijing	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	Beijing	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

INSERT INTO example_db.example_tbl_agg1 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);

After the import, SelectDB stores one aggregated row for user 10000 (rows 1 and 2 share the same key columns). All other users had unique keys, so their rows are stored unchanged:

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	Beijing	20	0	2017-10-01 07:00:00	35	10	2
10001	2017-10-01	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

How user 10000's row was aggregated:

last_visit_date (REPLACE): 2017-10-01 06:00:00 is replaced by 2017-10-01 07:00:00.
When the REPLACE type is applied within the same import batch, the replacement order is not guaranteed. The stored value could be either timestamp. Across different batches, the later batch always wins.
cost (SUM): 20 + 15 = 35.
max_dwell_time (MAX): max(10, 2) = 10.
min_dwell_time (MIN): min(10, 2) = 2.

After aggregation, the original raw rows no longer exist in storage.

Example 2: Aggregating new data with existing data

This example shows how SelectDB aggregates a second import batch against already-stored data.

Create example_tbl2 (same schema as example_tbl1):

CREATE TABLE IF NOT EXISTS test.example_tbl2
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Import batch 1:

INSERT INTO test.example_tbl2 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);

Import batch 2 (adds new data for user 10004 and introduces user 10005):

INSERT INTO test.example_tbl2 VALUES
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 11:22:00",44,19,19),
(10005,"2017-10-03","Changsha",29,1,"2017-10-03 18:11:02",3,1,1);

Final stored state:

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	Beijing	20	0	2017-10-01 07:00:00	35	10	2
10001	2017-10-01	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	Shenzhen	35	0	2017-10-03 11:22:00	55	19	6
10005	2017-10-03	Changsha	29	1	2017-10-03 18:11:02	3	1	1

User 10004's row for 2017-10-03 is aggregated across both batches (cost: 11 + 44 = 55; max_dwell_time: max(6, 19) = 19; min_dwell_time: min(6, 19) = 6). User 10005 is new, so the row is inserted as-is.

Aggregation in ApsaraDB for SelectDB happens at three stages:

ETL (extract, transform, and load) stage — each import batch is internally aggregated before being written.
Data compaction stage — the compute cluster merges data across different import batches in the background.
Query stage — any remaining unaggregated data is aggregated at query time.

The degree of aggregation at any point is transparent. Always assume that queried data is fully aggregated.

Example 3: Retaining detailed data using the Aggregate key model

Adding a high-cardinality column to the key columns prevents rows from sharing the same key — effectively retaining all raw data while still using the Aggregate key model.

Add a timestamp column (accurate to the second) to the key:

CREATE TABLE IF NOT EXISTS test.example_tbl3
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `timestamp` DATETIME NOT NULL COMMENT "The time when data is written to the table, which is accurate to seconds",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Because every row has a unique timestamp, no two rows ever share the same key columns — so no aggregation occurs. All raw data is retained.

Limitations of the Aggregate key model

Because the Aggregate key model always presents fully aggregated data, some queries produce unexpected results or incur high cost.

COUNT(*) overhead

Consider a table example_tbl8 with the following schema:

Column name	Type	Aggregation type	Comment
user_id	LARGEINT	N/A	The user ID.
date	DATE	N/A	The date on which data is written to the table.
cost	BIGINT	SUM	The amount of money that the user spends.

Import two batches:

Batch 1:

user_id	date	cost
10001	2017-11-20	50
10002	2017-11-21	39

Batch 2:

user_id	date	cost
10001	2017-11-20	1
10001	2017-11-21	5
10003	2017-11-22	22

The five original rows may still exist in underlying storage before compaction. Running COUNT(*):

SELECT COUNT(*) FROM example_tbl8;

Expected result: 4 (four unique key combinations). But if SelectDB scans only the user_id column and skips aggregation, it returns 3. If it counts all uncompacted rows, it returns 5. Both are wrong.

To return the correct result, the query engine must scan all key columns (user_id and date) and aggregate them — a full-key scan on every COUNT(*) call.

Workaround: Add a count column with a fixed value of 1 and aggregation type SUM:

Column name	Type	Aggregation type	Comment
user_id	BIGINT	N/A	The user ID.
date	DATE	N/A	The date on which data is written to the table.
cost	BIGINT	SUM	The amount of money that the user spends.
count	BIGINT	SUM	Row counter.

SELECT SUM(count) FROM table; is equivalent to SELECT COUNT(*) FROM table; and runs significantly faster. Note that this approach does not support re-importing rows with the same key columns — doing so inflates the counter.

Alternatively, set the count column's aggregation type to REPLACE with a fixed value of 1. This produces the same result as COUNT(*) and allows re-importing duplicate keys.

Aggregate query semantics

Each value column is bound to one aggregation type at table creation. Running a different aggregate function on a value column may return semantically incorrect results. Design the schema around the aggregation types you need.

Unique key model

The Unique key model differs from the Aggregate key model in that it enforces primary key uniqueness rather than aggregating values. When new rows arrive with the same key columns as existing rows, only the latest imported row is retained.

Use this model for relational data with uniqueness requirements, such as orders or user profiles. It does not support ROLLUP pre-aggregation.

ApsaraDB for SelectDB provides two implementation methods:

Merge on Write (MoW) — removes duplicates during the import stage. Recommended for best query performance.
Merge on Read (MoR) — applies deduplication at query time.

MoW (recommended)

MoW marks overwritten rows as deleted during import and writes the latest data directly. At query time, deleted rows are filtered out without any merge operation. This eliminates the aggregation overhead of MoR and enables predicate pushdown in most scenarios, resulting in faster aggregate queries.

MoW is disabled by default in ApsaraDB for SelectDB V3.0. Enable it by setting "enable_unique_key_merge_on_write" = "true" when creating the table.

CREATE TABLE IF NOT EXISTS test.example_tbl6
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `phone` LARGEINT COMMENT "The phone number of the user",
    `address` VARCHAR(500) COMMENT "The address of the user",
    `register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
PROPERTIES (
    "enable_unique_key_merge_on_write" = "true"
);

MoR cannot be upgraded seamlessly to MoW because they organize data differently. To migrate, use INSERT INTO unique-mow-table SELECT * FROM source_table to import existing data into a new MoW table.

The unique deletion sign and sequence column still work with MoW enabled.

MoR

MoR applies deduplication logic at query time by merging data on read. It is equivalent to the Aggregate key model with all value columns set to REPLACE.

CREATE TABLE IF NOT EXISTS test.example_tbl4
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `phone` LARGEINT COMMENT "The phone number of the user",
    `address` VARCHAR(500) COMMENT "The address of the user",
    `register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

The MoR Unique key model above is internally identical to the following Aggregate key model with all value columns using REPLACE:

CREATE TABLE IF NOT EXISTS test.example_tbl5
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) REPLACE COMMENT "The city in which the user resides",
    `age` SMALLINT REPLACE COMMENT "The age of the user",
    `sex` TINYINT REPLACE COMMENT "The gender of the user",
    `phone` LARGEINT REPLACE COMMENT "The phone number of the user",
    `address` VARCHAR(500) REPLACE COMMENT "The address of the user",
    `register_time` DATETIME REPLACE COMMENT "The time when the user is registered"
)
AGGREGATE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Duplicate key model

The Duplicate key model differs from the other two in that it stores all rows exactly as imported, with no aggregation and no uniqueness enforcement. Multiple rows with the same key column values coexist without affecting each other.

Use this model for ad hoc queries on raw data such as logs, where you need the full detail and do not need deduplication or pre-aggregation.

The key columns specified in the CREATE TABLE statement act as sort keys only, not unique identifiers. We recommend that you select the first two to four columns as the duplicate key.

CREATE TABLE IF NOT EXISTS test.example_tbl7
(
    `timestamp` DATETIME NOT NULL COMMENT "The time when the log was generated",
    `type` INT NOT NULL COMMENT "The type of the log",
    `error_code` INT COMMENT "The error code",
    `error_msg` VARCHAR(1024) COMMENT "The error message",
    `op_id` BIGINT COMMENT "The owner ID",
    `op_time` DATETIME COMMENT "The time when the error was handled"
)
DUPLICATE KEY(`timestamp`, `type`, `error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 16;

Column	Type	Sort key	Comment
timestamp	DATETIME	Yes	The time when the log was generated.
type	INT	Yes	The type of the log.
error_code	INT	Yes	The error code.
error_msg	VARCHAR(1024)	No	The error message.
op_id	BIGINT	No	The owner ID.
op_time	DATETIME	No	The time when the error was handled.

Key columns across models

Key columns serve different roles depending on the model:

Model	Role of key columns
Duplicate key model	Sort keys only — not unique identifiers
Aggregate key model	Sort keys and unique identifiers
Unique key model	Sort keys and unique identifiers

Choose a data model

Scenario	Recommended model
Fixed-pattern aggregation reports (for example, daily active users, revenue by region)	Aggregate key model
Relational data with primary key constraints (for example, orders, user profiles)	Unique key model (MoW)
High-performance aggregate queries on primary-key data	Unique key model with MoW enabled
Raw data analysis, log queries, ad hoc exploration	Duplicate key model
Partial column updates	Unique key model — see Partial Update

When not to use the Aggregate key model: Avoid it when your COUNT(*) queries must be fast, or when you need to run aggregate functions that differ from the aggregation types defined at table creation. The Aggregate key model is only efficient when rows collapse significantly — if most rows have unique keys, the pre-aggregation overhead outweighs the benefit.

How MoW improves COUNT(*) performance

MoW uses a delete bitmap to track overwritten rows rather than aggregating at query time. Using the same two-batch scenario from the Aggregate key model limitation above:

After batch 1:

user_id	date	cost	delete bit
10001	2017-11-20	50	false
10002	2017-11-21	39	false

After batch 2 (batch 1's duplicate row is marked deleted):

user_id	date	cost	delete bit
10001	2017-11-20	50	true
10002	2017-11-21	39	false

user_id	date	cost	delete bit
10001	2017-11-20	1	false
10001	2017-11-21	5	false
10003	2017-11-22	22	false

COUNT(*) skips rows with delete bit = true and scans only one column, returning 4 correctly with minimal overhead. In the test environment, COUNT(*) queries by using the MoW implementation method of the Unique key model deliver 10 times higher performance than those by using the Aggregate key model.