ApsaraDB for SelectDB supports three data models. Choose the right one before creating a table — the model cannot be changed afterward.
| Model | Best for | Key limitation |
|---|---|---|
| Aggregate key model | Fixed-pattern report queries, pre-aggregated metrics | COUNT(*) is expensive; each value column is bound to one aggregation type |
| Unique key model | Relational data requiring primary key uniqueness (orders, user profiles) | Cannot use ROLLUP pre-aggregation |
| Duplicate key model | Ad hoc queries, log analysis, raw data storage | No pre-aggregation; no automatic deduplication |
Background
In ApsaraDB for SelectDB, tables consist of rows and columns. Columns fall into two categories:
Key columns — correspond to dimension columns. Defined after the
DUPLICATE KEY,AGGREGATE KEY, orUNIQUE KEYkeywords in aCREATE TABLEstatement.Value columns — correspond to metric columns. All columns not listed as key columns.
These two categories map directly to the three data models.
Aggregate key model
The Aggregate key model pre-aggregates data at write time. When you import rows that share identical values in all key columns, SelectDB collapses them into a single row. Each value column is aggregated using the function you specify in the CREATE TABLE statement.
Use this model for report queries with fixed aggregation patterns — for example, daily active users or revenue by region. If your queries vary in aggregation logic, or if you need COUNT(*) to be fast, use the Duplicate key model or the Unique key model with MoW instead. See Limitations of the Aggregate key model for details.
Aggregation types
| Aggregation type | Description |
|---|---|
SUM | Calculates the sum across rows. Applicable to numeric values. |
MIN | Retains the minimum value. Applicable to numeric values. |
MAX | Retains the maximum value. Applicable to numeric values. |
REPLACE | Replaces the previous value with the newly imported value. For rows with the same key columns, values are replaced in import order. |
REPLACE_IF_NOT_NULL | Same as REPLACE, but skips null values. Specify null (not an empty string) as the column default; otherwise empty strings are overwritten. |
HLL_UNION | Aggregates columns of the HyperLogLog (HLL) type using the HLL algorithm. |
BITMAP_UNION | Aggregates BITMAP columns using union aggregation. |
Example 1: Basic aggregation on import
The example_tbl1 table records user visit data. Key columns (user_id, date, city, age, sex) identify unique records; value columns store metrics.
| Column name | Type | Aggregation type | Comment |
|---|---|---|---|
| user_id | LARGEINT | N/A | The user ID. |
| date | DATE | N/A | The date on which data is written to the table. |
| city | VARCHAR(20) | N/A | The city in which the user resides. |
| age | SMALLINT | N/A | The age of the user. |
| sex | TINYINT | N/A | The gender of the user. |
| last_visit_date | DATETIME | REPLACE | The last time when the user paid a visit. |
| cost | BIGINT | SUM | The amount of money that the user spends. |
| max_dwell_time | INT | MAX | The maximum dwell time of the user. |
| min_dwell_time | INT | MIN | The minimum dwell time of the user. |
CREATE TABLE IF NOT EXISTS test.example_tbl1
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is written to the table",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;Insert the following rows:
| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
|---|---|---|---|---|---|---|---|---|
| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
INSERT INTO example_db.example_tbl_agg1 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);After the import, SelectDB stores one aggregated row for user 10000 (rows 1 and 2 share the same key columns). All other users had unique keys, so their rows are stored unchanged:
| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
|---|---|---|---|---|---|---|---|---|
| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
How user 10000's row was aggregated:
last_visit_date (
REPLACE):2017-10-01 06:00:00is replaced by2017-10-01 07:00:00.When the
REPLACEtype is applied within the same import batch, the replacement order is not guaranteed. The stored value could be either timestamp. Across different batches, the later batch always wins.cost (
SUM): 20 + 15 = 35.max_dwell_time (
MAX): max(10, 2) = 10.min_dwell_time (
MIN): min(10, 2) = 2.
After aggregation, the original raw rows no longer exist in storage.
Example 2: Aggregating new data with existing data
This example shows how SelectDB aggregates a second import batch against already-stored data.
Create example_tbl2 (same schema as example_tbl1):
CREATE TABLE IF NOT EXISTS test.example_tbl2
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is written to the table",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;Import batch 1:
INSERT INTO test.example_tbl2 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);Import batch 2 (adds new data for user 10004 and introduces user 10005):
INSERT INTO test.example_tbl2 VALUES
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 11:22:00",44,19,19),
(10005,"2017-10-03","Changsha",29,1,"2017-10-03 18:11:02",3,1,1);Final stored state:
| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
|---|---|---|---|---|---|---|---|---|
| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 |
| 10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
User 10004's row for 2017-10-03 is aggregated across both batches (cost: 11 + 44 = 55; max_dwell_time: max(6, 19) = 19; min_dwell_time: min(6, 19) = 6). User 10005 is new, so the row is inserted as-is.
Aggregation in ApsaraDB for SelectDB happens at three stages:
ETL (extract, transform, and load) stage — each import batch is internally aggregated before being written.
Data compaction stage — the compute cluster merges data across different import batches in the background.
Query stage — any remaining unaggregated data is aggregated at query time.
The degree of aggregation at any point is transparent. Always assume that queried data is fully aggregated.
Example 3: Retaining detailed data using the Aggregate key model
Adding a high-cardinality column to the key columns prevents rows from sharing the same key — effectively retaining all raw data while still using the Aggregate key model.
Add a timestamp column (accurate to the second) to the key:
CREATE TABLE IF NOT EXISTS test.example_tbl3
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is written to the table",
`timestamp` DATETIME NOT NULL COMMENT "The time when data is written to the table, which is accurate to seconds",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;Because every row has a unique timestamp, no two rows ever share the same key columns — so no aggregation occurs. All raw data is retained.
Limitations of the Aggregate key model
Because the Aggregate key model always presents fully aggregated data, some queries produce unexpected results or incur high cost.
COUNT(*) overhead
Consider a table example_tbl8 with the following schema:
| Column name | Type | Aggregation type | Comment |
|---|---|---|---|
| user_id | LARGEINT | N/A | The user ID. |
| date | DATE | N/A | The date on which data is written to the table. |
| cost | BIGINT | SUM | The amount of money that the user spends. |
Import two batches:
Batch 1:
| user_id | date | cost |
|---|---|---|
| 10001 | 2017-11-20 | 50 |
| 10002 | 2017-11-21 | 39 |
Batch 2:
| user_id | date | cost |
|---|---|---|
| 10001 | 2017-11-20 | 1 |
| 10001 | 2017-11-21 | 5 |
| 10003 | 2017-11-22 | 22 |
The five original rows may still exist in underlying storage before compaction. Running COUNT(*):
SELECT COUNT(*) FROM example_tbl8;Expected result: 4 (four unique key combinations). But if SelectDB scans only the user_id column and skips aggregation, it returns 3. If it counts all uncompacted rows, it returns 5. Both are wrong.
To return the correct result, the query engine must scan all key columns (user_id and date) and aggregate them — a full-key scan on every COUNT(*) call.
Workaround: Add a count column with a fixed value of 1 and aggregation type SUM:
| Column name | Type | Aggregation type | Comment |
|---|---|---|---|
| user_id | BIGINT | N/A | The user ID. |
| date | DATE | N/A | The date on which data is written to the table. |
| cost | BIGINT | SUM | The amount of money that the user spends. |
| count | BIGINT | SUM | Row counter. |
SELECT SUM(count) FROM table; is equivalent to SELECT COUNT(*) FROM table; and runs significantly faster. Note that this approach does not support re-importing rows with the same key columns — doing so inflates the counter.
Alternatively, set the count column's aggregation type to REPLACE with a fixed value of 1. This produces the same result as COUNT(*) and allows re-importing duplicate keys.
Aggregate query semantics
Each value column is bound to one aggregation type at table creation. Running a different aggregate function on a value column may return semantically incorrect results. Design the schema around the aggregation types you need.
Unique key model
The Unique key model differs from the Aggregate key model in that it enforces primary key uniqueness rather than aggregating values. When new rows arrive with the same key columns as existing rows, only the latest imported row is retained.
Use this model for relational data with uniqueness requirements, such as orders or user profiles. It does not support ROLLUP pre-aggregation.
ApsaraDB for SelectDB provides two implementation methods:
Merge on Write (MoW) — removes duplicates during the import stage. Recommended for best query performance.
Merge on Read (MoR) — applies deduplication at query time.
MoW (recommended)
MoW marks overwritten rows as deleted during import and writes the latest data directly. At query time, deleted rows are filtered out without any merge operation. This eliminates the aggregation overhead of MoR and enables predicate pushdown in most scenarios, resulting in faster aggregate queries.
MoW is disabled by default in ApsaraDB for SelectDB V3.0. Enable it by setting "enable_unique_key_merge_on_write" = "true" when creating the table.CREATE TABLE IF NOT EXISTS test.example_tbl6
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`phone` LARGEINT COMMENT "The phone number of the user",
`address` VARCHAR(500) COMMENT "The address of the user",
`register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
PROPERTIES (
"enable_unique_key_merge_on_write" = "true"
);MoR cannot be upgraded seamlessly to MoW because they organize data differently. To migrate, use INSERT INTO unique-mow-table SELECT * FROM source_table to import existing data into a new MoW table.The unique deletion sign and sequence column still work with MoW enabled.
MoR
MoR applies deduplication logic at query time by merging data on read. It is equivalent to the Aggregate key model with all value columns set to REPLACE.
CREATE TABLE IF NOT EXISTS test.example_tbl4
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`phone` LARGEINT COMMENT "The phone number of the user",
`address` VARCHAR(500) COMMENT "The address of the user",
`register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;The MoR Unique key model above is internally identical to the following Aggregate key model with all value columns using REPLACE:
CREATE TABLE IF NOT EXISTS test.example_tbl5
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
`city` VARCHAR(20) REPLACE COMMENT "The city in which the user resides",
`age` SMALLINT REPLACE COMMENT "The age of the user",
`sex` TINYINT REPLACE COMMENT "The gender of the user",
`phone` LARGEINT REPLACE COMMENT "The phone number of the user",
`address` VARCHAR(500) REPLACE COMMENT "The address of the user",
`register_time` DATETIME REPLACE COMMENT "The time when the user is registered"
)
AGGREGATE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;Duplicate key model
The Duplicate key model differs from the other two in that it stores all rows exactly as imported, with no aggregation and no uniqueness enforcement. Multiple rows with the same key column values coexist without affecting each other.
Use this model for ad hoc queries on raw data such as logs, where you need the full detail and do not need deduplication or pre-aggregation.
The key columns specified in the CREATE TABLE statement act as sort keys only, not unique identifiers. We recommend that you select the first two to four columns as the duplicate key.
CREATE TABLE IF NOT EXISTS test.example_tbl7
(
`timestamp` DATETIME NOT NULL COMMENT "The time when the log was generated",
`type` INT NOT NULL COMMENT "The type of the log",
`error_code` INT COMMENT "The error code",
`error_msg` VARCHAR(1024) COMMENT "The error message",
`op_id` BIGINT COMMENT "The owner ID",
`op_time` DATETIME COMMENT "The time when the error was handled"
)
DUPLICATE KEY(`timestamp`, `type`, `error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 16;| Column | Type | Sort key | Comment |
|---|---|---|---|
| timestamp | DATETIME | Yes | The time when the log was generated. |
| type | INT | Yes | The type of the log. |
| error_code | INT | Yes | The error code. |
| error_msg | VARCHAR(1024) | No | The error message. |
| op_id | BIGINT | No | The owner ID. |
| op_time | DATETIME | No | The time when the error was handled. |
Key columns across models
Key columns serve different roles depending on the model:
| Model | Role of key columns |
|---|---|
| Duplicate key model | Sort keys only — not unique identifiers |
| Aggregate key model | Sort keys and unique identifiers |
| Unique key model | Sort keys and unique identifiers |
Choose a data model
| Scenario | Recommended model |
|---|---|
| Fixed-pattern aggregation reports (for example, daily active users, revenue by region) | Aggregate key model |
| Relational data with primary key constraints (for example, orders, user profiles) | Unique key model (MoW) |
| High-performance aggregate queries on primary-key data | Unique key model with MoW enabled |
| Raw data analysis, log queries, ad hoc exploration | Duplicate key model |
| Partial column updates | Unique key model — see Partial Update |
When not to use the Aggregate key model: Avoid it when your COUNT(*) queries must be fast, or when you need to run aggregate functions that differ from the aggregation types defined at table creation. The Aggregate key model is only efficient when rows collapse significantly — if most rows have unique keys, the pre-aggregation overhead outweighs the benefit.
How MoW improves COUNT(*) performance
MoW uses a delete bitmap to track overwritten rows rather than aggregating at query time. Using the same two-batch scenario from the Aggregate key model limitation above:
After batch 1:
| user_id | date | cost | delete bit |
|---|---|---|---|
| 10001 | 2017-11-20 | 50 | false |
| 10002 | 2017-11-21 | 39 | false |
After batch 2 (batch 1's duplicate row is marked deleted):
| user_id | date | cost | delete bit |
|---|---|---|---|
| 10001 | 2017-11-20 | 50 | true |
| 10002 | 2017-11-21 | 39 | false |
| user_id | date | cost | delete bit |
|---|---|---|---|
| 10001 | 2017-11-20 | 1 | false |
| 10001 | 2017-11-21 | 5 | false |
| 10003 | 2017-11-22 | 22 | false |
COUNT(*) skips rows with delete bit = true and scans only one column, returning 4 correctly with minimal overhead. In the test environment, COUNT(*) queries by using the MoW implementation method of the Unique key model deliver 10 times higher performance than those by using the Aggregate key model.