This topic describes the development standards for Hologres. Follow these standards to understand development requirements and prevent operational errors.
Data domain standards
Data warehouse layers
A data warehouse is typically built in layers. The main layers are listed below. The common data model (CDM) layer includes the data warehouse detail (DWD), data warehouse service (DWS), and dimension (DIM) layers. In Hologres, you can use schemas to isolate different layers.
Operation data store (ODS): The operational data layer.
Common data model (CDM): The public dimension model layer.
Data warehouse detail (DWD): The detail data layer.
DWS (Data Warehouse Summary): Data warehouse summary.
Dimension (DIM): The dimension data layer.
Application data service (ADS): The application data layer.
You can choose the appropriate level of granularity based on your business complexity. For example, if your enterprise has multiple business units (BUs), you can use the BU abbreviation as the schema prefix.
create schema ${bu}_ads; create schema ${bu}_ads_dev; create schema ${bu}_dwd; create schema ${bu}_dwd_dev; create schema ${bu}_dws; create schema ${bu}_dws_dev; create schema ${bu}_dim; create schema ${bu}_dim_dev; create schema ${bu}_ods; create schema ${bu}_ods_dev;Data domain abbreviations
Define shared codes for different data domains to establish a company-wide standard. The following table provides some examples.
Data domain name
Data Domain Abbreviations: Examples
Transaction domain
trd
Item domain
itm
Log domain
log
Member and store domain
mbr
Supply, sales, and inventory management domain
dst
Sales and customer service domain
crm
Credit and risk control domain
rsk
Tools and services domain
tls
Logistics and express delivery domain
lgt
Naming conventions
Job naming conventions
The naming rules differ for internal tasks and sync tasks. The rules are as follows.
Internal SQL tasks (non-sync tasks):
holo_{target_table_name}. This format distinguishes them from external table tasks.Data import to Hologres:
{source}2holo_{target_table_name}.Data export from Hologres:
holo2{target}_{target_table_name}.
Table naming convention
Layer name
Naming rule for tables in this layer
Example
DWD
${bu}_dwd.data_domain_business_process_[custom_root]_suffixtaobao_dwd.trd_ord_flowDWS
${bu}_dws.data_domain_data_granularity_abbreviation_business_process_[{custom_root}]_statistical_periodtaobao_dws.trd_all_dtr, taobao_dws.log_slr_pv_dtrDIM
${bu}_dim.{dimension_definition}[_{custom_root}]taobao_cdm.dim_itmADS
${bu}_ads.business_domain_dimension_[{custom_root}]_{refresh_period_identifier}NoteThe refresh period identifiers are as follows.
d: Refreshed daily.
r: Refreshed in real time.
h: Refreshed in near real-time.
taobao_ads.trd_cate_dTable Group naming convention
If your business requires multiple Table Groups, name them using the following format:
${bu}_{data_warehouse_layer_name}_{business_definition}_tg.View naming convention
The following section describes the naming rules and provides an example for persistent views.
Rules
DWS:
${bu}_dws.data_domain_data_granularity_abbreviation_business_process_[{custom_root}]_statistical_period_v.ADS:
${bu}_ads.business_domain_dimension_[{custom_root}]_{refresh_period_identifier}_v.
Example
taobao_dws.trd_byr_itm_ord_cm_v
External table naming conventions
Add the
extsuffix to the original MaxCompute table name. The following code provides an example.taobao_dim.camp_extTemporary table naming convention
Add the
tmpprefix and a numeric suffix to the original table name. The following code provides an example.taobao_dim.tmp_camp_01Common abbreviations
Statistical period
Abbreviation
Last day
1d
Recent days
nd
Cumulative
td
Calendar week
cw
Calendar month
cm
Cumulative to date
dtr
Cumulative to the current hour
dhr
Table development standards
Internal table standards
Before you create a table, determine the names for the table and its fields according to the data model standards. You must also confirm the table's lifecycle based on your requirements and add complete comments for the table and its fields. The relevant standards are as follows.
Strict standards (Publication is not allowed if these standards are not met):
The output table and its fields must include a comment. This rule applies to all data development scenarios on the platform. The table's comment must be concise and clear.
The table creation statement must specify the table lifecycle (time_to_live_in_seconds).
The table creation statement must specify a distribution key (distribute_key). The principles for selecting a distribution key are as follows.
Select a field that is sufficiently distributed and is most frequently used in JOIN or GROUP BY operations. For example, for a buyer-item table, you can set user_id and item_id as the distribution key. However, if user_id is the key that is most frequently used for joins, set the distribution key to user_id only, not to both user_id and item_id.
Tables that are joined in queries must be created in the same Table Group.
Use the same name and data type for the same entity ID across all fact tables and dimension tables. For example, if the user ID in a transaction table is user_id, it must also be user_id in the dimension table, not uid. Keep the data types consistent to reduce data type conversions.
By default, use
dsas the partition field for all physical tables.
Recommended standards:
The table creation statement should specify at least one of the following properties: bitmap_columns, segment_key, or cluster_key.
If the cardinality of a field is unclear, do not set the
dictionary_encoding_columns(dictionary index) table property. You can run the following statement to set the property to empty.call set_table_property('table_name', 'dictionary_encoding_columns','')For the orientation (data storage format) table property, the column format is recommended. You can also set this property to row.
NoteDo not use the row format unless you can ensure that all queries on the table always specify all primary key columns using the equals or in operator. If this property is not set, the column storage format is used by default.
The bitmap_columns (bit encoding) table property enables bitmap to quickly filter data within storage files.
Set bitmap_columns to the fields that are used in filter conditions. By default, all TEXT fields are included.
Do not set fields with high cardinality, such as user_id, as bitmap_columns. We recommend that you set metrics such as activity ID as bitmap_columns.
The event_time_column table property must be used for fields that are related to real-time writes, such as an event timestamp.
The clustering_key (cluster index) table property sorts data based on the specified cluster index. Creating a cluster index can accelerate range and filter queries on the index columns. You can set only one cluster index. This property is suitable for range filtering, such as when you bucket GMV.
MaxCompute foreign table standards
Hologres supports accelerated queries on MaxCompute through foreign tables, which can simplify the data synchronization process. To improve computing performance, do not join internal tables with foreign tables unless it is necessary. To better manage and maintain foreign tables, you must follow these standards.
Strict standard: You must strictly follow the foreign table naming convention by adding the
extsuffix to the original MaxCompute table name.Recommended standards:
Preserve the DDL of the table schema and place it under version control.
Do not join internal tables with foreign tables. Instead, synchronize data from the foreign table to an internal table.
View standards
Strict standard: You must strictly follow the view naming convention.
Recommended standards:
Enable task scheduling to ensure the integrity of subsequent development job dependency chains.
Create separate views for different granularities to avoid excessive computation for complex requests.
For example, you can create four separate views for cw, cm, nd, and 1d. If you have different clients, you can create views for pc, wap, and app. If you use different collection methods, you can separate them into ut and non-ut.
Lifecycle (internal tables only) standards
Data warehouse layer
Corresponding lifecycle rule description
DWD
For daily incremental details, the recommended retention period is no more than 2 years.
DWS
For daily incremental details, the recommended retention period is no more than 2 years.
DIM
For large dimension tables, permanent retention is recommended after extreme storage modeling. For small dimension tables, keep the lifecycle consistent with the MaxCompute table.
The standard for distinguishing between large and small dimension tables: a single partition cannot exceed 1 TB.
Recommended standard:
For partitioned tables, write real-time task data into the current day's partition. Set an appropriate TTL based on the data warehouse layer. Do not write updated historical data into partitions that have already exceeded their TTL.
Table Group standards (optional)
Each database has a default Table Group and shard count. You can create new Table Groups or modify the shard count based on your business needs to achieve better performance. The recommended standards are as follows.
Do not create new Table Groups unless it is necessary.
For tables with a large data volume, you can create a separate Table Group with a larger shard count.
For many tables with small data volumes, you can create a Table Group with a smaller shard count.
Tables that need to be joined in queries must be placed in the same Table Group.
Field development standards
Field type standards
You must create field types in strict accordance with the following requirements.
Field/Field suffix
Field comment
Example
Abbreviations
user_id
Auto-incrementing member ID
user_id=232442843int8item_id
Item ID
item_id=63283278784383int8member_id
Registered member ID
member_id=b2b-dsajk2343821bTEXT
*amt*
Amount type
pay_ord_amt_1d_001=923.23NUMERIC
*fee*
Fee type
post_fee=923.23NUMERIC
*cnt*
Quantity type
pay_ord_byr_cnt_1d_001=923int4/int8is_*
Boolean type
is_pm=Y/is_pm=trueTEXT/BOOL
ds
Partition
ds=20210120YYYYMMDD
Basic data type reference
Hologres data types are compatible with PostgreSQL data types, but Hologres supports only a subset of them. For more information about field types and their mappings with MaxCompute, see Data type summary.
Currency unit and precision
The currency unit is USD. Unless otherwise specified in the model, do not round any amount-related data. This practice prevents inconsistencies in summary calculations that use different statistical methods.
SQL standards
Strict standards:
In the outermost layer of an SQL statement and within subqueries, do not use
select *for fields that do not require calculation. All operations must explicitly specify column names.In WHERE clauses, you must handle null fields and empty strings using the Coalesce function.
Recommended standards:
Frequently use a count distinct field as the distribution_key. For combinations of multiple count distinct operations, you must rewrite the statement manually.
select count(distinct userid) , count(distinct case when stat_date = '20201111' then userid end) from t group by cate_id; --Rewrite as follows select count(1), sum(c) from ( select userid , cate_id , cast(count(case when stat_date = '20201111' then 1 end) > 0) as c from t group by cate_id, userid ) t1 group by cate_id;For offline scheduling tasks, perform an analyze table operation on partitioned tables.
For long-term usage scenarios, use ATTACH/DETACH operations for batch operations on historical partitions to avoid drastic fluctuations in data metrics.