Hologres development standards - Hologres - Alibaba Cloud Documentation Center

Learn the naming conventions, table design rules, field type standards, and SQL best practices for Hologres development.

Data domain standards

Data warehouse layers

A data warehouse is built in layers, isolated by schemas in Hologres. The CDM layer includes DWD, DWS, and DIM.
- Operation data store (ODS): The operational data layer.
- Common data model (CDM): The public dimension model layer.
  - Data warehouse detail (DWD): The detail data layer.
  - DWS (Data Warehouse Summary): Data warehouse summary.
  - Dimension (DIM): The dimension data layer.
- Application data service (ADS): The application data layer.
Choose the granularity based on your business complexity. For example, if your enterprise has multiple business units (BUs), use the BU abbreviation as the schema prefix.
```
create schema ${bu}_ads;
create schema ${bu}_ads_dev;
create schema ${bu}_dwd;
create schema ${bu}_dwd_dev;
create schema ${bu}_dws;
create schema ${bu}_dws_dev;
create schema ${bu}_dim;
create schema ${bu}_dim_dev;
create schema ${bu}_ods;
create schema ${bu}_ods_dev;
```

Data domain abbreviations

Define shared codes for data domains to establish a company-wide standard. Examples:

Data domain name	Abbreviation
Transaction domain	trd
Item domain	itm
Log domain	log
Member and store domain	mbr
Supply, sales, and inventory management domain	dst
Sales and customer service domain	crm
Credit and risk control domain	rsk
Tools and services domain	tls
Logistics and express delivery domain	lgt

Naming conventions

Job naming conventions

Naming rules differ for internal tasks and sync tasks:
- Internal SQL tasks (non-sync tasks): holo_{target_table_name}. This format distinguishes them from external table tasks.
- Data import to Hologres: {source}2holo_{target_table_name}.
- Data export from Hologres: holo2{target}_{target_table_name}.

Table naming convention

Layer name	Naming rule for tables in this layer	Example
DWD	`${bu}_dwd.data_domain_business_process_[custom_root]_suffix`	`taobao_dwd.trd_ord_flow`
DWS	`${bu}_dws.data_domain_data_granularity_abbreviation_business_process_[{custom_root}]_statistical_period`	`taobao_dws.trd_all_dtr, taobao_dws.log_slr_pv_dtr`
DIM	`${bu}_dim.{dimension_definition}[_{custom_root}]`	`taobao_cdm.dim_itm`
ADS	`${bu}_ads.business_domain_dimension_[{custom_root}]_{refresh_period_identifier}` Note The refresh period identifiers are as follows. d: Refreshed daily. r: Refreshed in real time. h: Refreshed in near real-time.	`taobao_ads.trd_cate_d`

Table Group naming convention

For multiple Table Groups, use this format: ${bu}_{data_warehouse_layer_name}_{business_definition}_tg.
View naming convention

Naming rules and example for persistent views:
- Rules
  - DWS: ${bu}_dws.data_domain_data_granularity_abbreviation_business_process_[{custom_root}]_statistical_period_v.
  - ADS: ${bu}_ads.business_domain_dimension_[{custom_root}]_{refresh_period_identifier}_v.
- Example
```
taobao_dws.trd_byr_itm_ord_cm_v
```
External table naming conventions

Append the ext suffix to the MaxCompute table name. Example:
```
taobao_dim.camp_ext
```
Temporary table naming convention

Add the tmp prefix and a numeric suffix to the table name. Example:
```
taobao_dim.tmp_camp_01
```

Common abbreviations

Statistical period	Abbreviation
Last day	1d
Recent days	nd
Cumulative	td
Calendar week	cw
Calendar month	cm
Cumulative to date	dtr
Cumulative to the current hour	dhr

Table development standards

Internal table standards

Before creating a table, determine names per the data model standards, set the lifecycle, and add comments for the table and all fields.
- Strict standards (required for publication):
  - Every table and field must have a concise comment. This applies to all data development scenarios.
  - The table creation statement must specify the table lifecycle (time_to_live_in_seconds).
  - The table creation statement must specify a distribution key (distribute_key). The principles for selecting a distribution key are as follows.
    
    Select a well-distributed field frequently used in JOIN or GROUP BY. For example, for a buyer-item table, you could set user_id and item_id as the distribution key. However, if user_id is the most common join key, set only user_id, not both user_id and item_id.
  - Tables that are joined in queries must be created in the same Table Group.
  - Use the same name and data type for entity IDs across all fact and dimension tables. For example, if the transaction table uses user_id, the dimension table must also use user_id, not uid. Consistent types reduce conversions.
  - By default, use ds as the partition field for all physical tables.
- Recommended standards:
  - The table creation statement should specify at least one of the following properties: bitmap_columns, segment_key, or cluster_key.
  - If a field's cardinality is unclear, leave dictionary_encoding_columns unset. Clear it with:
```
call set_table_property('table_name', 'dictionary_encoding_columns','')
```
  - For the orientation (data storage format) table property, the column format is recommended. You can also set this property to row.
    
    Note
    Use row format only if all queries specify all primary key columns with the equals or in operator. Default: column format.
  - The bitmap_columns property enables bitmap-based filtering within storage files.
    - Set bitmap_columns to the fields that are used in filter conditions. By default, all TEXT fields are included.
    - Do not set high-cardinality fields such as user_id as bitmap_columns. Use low-cardinality fields such as activity ID instead.
  - The event_time_column table property must be used for fields that are related to real-time writes, such as an event timestamp.
  - The clustering_key property sorts data by the specified index, accelerating range and filter queries. Only one cluster index is allowed. Suitable for range filtering such as GMV bucketing.
MaxCompute foreign table standards

Hologres supports accelerated queries on MaxCompute through foreign tables. Avoid joining internal tables with foreign tables unless necessary. Follow these standards to manage foreign tables effectively.
- Strict standard: Follow the foreign table naming convention. Append the ext suffix to the MaxCompute table name.
- Recommended standards:
  - Preserve the DDL of the table schema and place it under version control.
  - Do not join internal tables with foreign tables. Instead, synchronize data from the foreign table to an internal table.
View standards
- Strict standard: You must strictly follow the view naming convention.
- Recommended standards:
  - Enable task scheduling to maintain job dependency chains.
  - Create separate views for different granularities to avoid excessive computation.
    
    For example, create separate views for cw, cm, nd, and 1d. For different clients, create views for pc, wap, and app. For different collection methods, separate into ut and non-ut.

Lifecycle (internal tables only) standards

Data warehouse layer	Corresponding lifecycle rule description
DWD	For daily incremental details, the recommended retention period is no more than 2 years.
DWS	For daily incremental details, the recommended retention period is no more than 2 years.
DIM	Large dimension tables: permanent retention after storage modeling. Small dimension tables: match the MaxCompute table lifecycle. Large vs. small threshold: a single partition must not exceed 1 TB.

Recommended standard:

For partitioned tables, write real-time data to the current day's partition and set TTL based on the data warehouse layer. Do not write to partitions past their TTL.

Table Group standards (optional)
Each database has a default Table Group and shard count. Create new Table Groups or adjust the shard count as needed for better performance.
- Do not create new Table Groups unless necessary.
- For tables with a large data volume, you can create a separate Table Group with a larger shard count.
- For many tables with small data volumes, you can create a Table Group with a smaller shard count.
- Tables that need to be joined in queries must be placed in the same Table Group.

Field development standards

Field type standards

Create fields using the following type requirements.

Field/Field suffix	Field comment	Example	Abbreviations
user_id	Auto-incrementing member ID	`user_id=232442843`	`int8`
item_id	Item ID	`item_id=63283278784383`	`int8`
member_id	Registered member ID	`member_id=b2b-dsajk2343821b`	TEXT
amt	Amount type	`pay_ord_amt_1d_001=923.23`	NUMERIC
fee	Fee type	`post_fee=923.23`	NUMERIC
cnt	Quantity type	`pay_ord_byr_cnt_1d_001=923`	`int4/int8`
is_*	Boolean type	`is_pm=Y/is_pm=true`	TEXT/BOOL
ds	Partition	`ds=20210120`	YYYYMMDD

Basic data type reference

Hologres data types are compatible with a subset of PostgreSQL types. Field types and their MaxCompute mappings are listed in Data type summary.
Currency unit and precision

The currency unit is USD. Unless otherwise specified in the model, do not round any amount-related data. This practice prevents inconsistencies in summary calculations that use different statistical methods.

SQL standards

Strict standards:
- Do not use select * in the outermost query or subqueries. Always specify column names explicitly.
- In WHERE clauses, handle null fields and empty strings with the Coalesce function.

Recommended standards:

Use a count distinct field as the distribution_key. For multiple count distinct operations, rewrite the statement manually.

select count(distinct userid)
     , count(distinct case when stat_date = '20201111' then userid end) 
from t group by cate_id;

--Rewrite as follows
select count(1), sum(c) from 
(
  select userid
       , cate_id
       , cast(count(case when stat_date = '20201111' then 1 end) > 0) as c 
  from t 
  group by cate_id, userid
) t1 
group by cate_id;

For offline scheduling tasks, run analyze table on partitioned tables.
For long-term usage, use ATTACH/DETACH for batch operations on historical partitions to avoid drastic data metric fluctuations.