This topic describes the Hologres development standards. You can prevent operation errors by complying with these standards.

Data domains

  • Data layers

    A data warehouse is divided into the operational data store (ODS), common data model (CDM), and application data store (ADS) layers. The CDM layer consists of the data warehouse detail (DWD), data warehouse service (DWS), and dimension data (DIM) layers. In Hologres, layers are distinguished by schema.

    • ODS
    • CDM
      • DWD
      • DWS
      • DIM
    • ADS

    You can divide the data warehouse of your enterprise into layers at different levels. If your enterprise has different business units, you can divide the data warehouse by business unit and use the abbreviation of a business unit name as the prefix of the corresponding schema to distinguish layers.

    create schema ${bu}_ads;
    create schema ${bu}_ads_dev;
    create schema ${bu}_dwd;
    create schema ${bu}_dwd_dev;
    create schema ${bu}_dws;
    create schema ${bu}_dws_dev;
    create schema ${bu}_dim;
    create schema ${bu}_dim_dev;
    create schema ${bu}_ods;
    create schema ${bu}_ods_dev;
  • Abbreviations of data domains

    We recommend that you specify an abbreviation for each data domain and enforce the use of abbreviations within your enterprise to ensure consistency. The following table describes some abbreviations of data domains that you can use.

    Data domain Sample abbreviation
    Transaction trd
    Product itm
    Log log
    Member and store mbr
    Purchase, stock, and sale management dst
    Sale and customer service crm
    Credit and risk control rsk
    Tool and service tls
    Logistics and express delivery lgt

Naming conventions

  • Jobs

    To distinguish between internal and synchronization jobs, you must comply with the following naming conventions:

    • Internal asynchronous SQL jobs: holo_{target_table_name}. This distinguishes internal SQL jobs from jobs that involve external tables.
    • Data import jobs: {source}2holo_{target_table_name}.
    • Data export jobs: holo2{target}_{target_table_name}.
  • Tables
    Layer Format Example
    DWD ${bu}_dwd.Data domain_Business process_[Custom root]_Suffix taobao_dwd._trd_ord_flow
    DWS ${bu}_dws.Data domain_Abbreviated statistic granularity_Business process_[{Custom root}]_Statistical period taobao_dws.trd_all_dtr, taobao_dws.log_slr_pv_dtr
    DIM ${bu}_dim.{Dimension}[_{Custom root}] taobao_cdm.dim_itm
    ADS ${bu}_ads.Business domain_Dimension_[{Custom root}]_{Update frequency}
    Note Valid values of the Update frequency variable:
    • d: daily update
    • r: real-time update
    • h: quasi-real-time update
    taobao_ads.trd_cate_d
  • Table groups

    If you require multiple table groups, you can name the table groups in the following format: ${bu}_{Layer}_{Business}_tg.

  • Views

    Views can be named in the following formats:

    • Formats
      • DWS: ${bu}_dws.Data domain_Abbreviated statistic granularity_Business process_[{Custom root}]_Statistical period_v
      • ADS: ${bu}_ads.Business domain_Dimension_[{Custom root}]_{Update frequency}_v
    • Example
      taobao_dws.trd_byr_itm_ord_cm_v
  • Foreign tables

    We recommend that you use the name of the source MaxCompute table suffixed with ext as the name of a foreign table. Example:

    taobao_dim.camp_ext
  • Temporary tables

    We recommend that you use the name of the original table prefixed with tmp and suffixed with a sequence number as the name of a temporary table. Example:

    taobao_dim.tmp_camp_01
  • Abbreviations
    Statistical period Abbreviation
    Last one day 1d
    Last N days nd
    Total days td
    Calendar week cw
    Calendar month cm
    Accrual to the current day dtr
    Accrual to the current hour dhr

Table configuration standards

  • Internal tables

    When you create an internal table, you must comply with the naming conventions for the table and its fields. In addition, you must set the lifecycle of the table based on your business requirements and add comments for the table and its fields. Take note of the following standards and recommendations when you configure an internal table:

    • Standards that you must comply with to ensure successful creation of the internal table
      • Add concise and accurate comments for the table and its fields to facilitate data development and analysis.
      • Specify the time_to_live_in_seconds property in the statement used to create the table. This property is used to specify the lifecycle of the table.
      • Specify the distribute_key property in the statement used to create the table. This property is used to specify the distribution key.

        The selected distribution key must have sufficient cardinality and is most frequently used for the JOIN and GROUP BY operations. For example, in a table that stores product data, you may select the user_id and item_id fields as the distribution key. However, if the user_id field is most frequently used to join data, you must select only the user_id field as the distribution key.

      • Create tables involved in join queries within the same table group.
      • Use the same name and data type for the same entity ID in all fact tables and dimension tables. For example, if you specify user_id as the user ID in a transaction table, you must also use user_id rather than uid as the user ID in a dimension table. Consistency in data type eliminates the need to convert data types.
      • Use ds for the partition fields of all physical tables.
    • Recommendations
      • Specify one of the bitmap_columns, segment_key, and cluster_key properties in the statement used to create the table.
      • Do not specify the dictionary_encoding_columns property if you are unsure about the cardinality of fields in the table. This property is used to build dictionary mappings. You can execute the following statement to skip dictionary mappings:
        call set_table_property('table_name', 'dictionary_encoding_columns','')
      • Set the orientation property to column to set the table as a column-oriented table. The row-oriented storage type is also supported by Hologres.
        Note Do not set the table as a row-oriented table unless all the primary key columns of the table can be specified in each query on the table by using the equal to or IN operator. By default, if the orientation property is not specified, the column-oriented storage type is used.
      • Set the bitmap_columns property. This property is used to specify the fields for which Hologres builds bitmap indexes. Bitmap indexes can be used to filter stored data.
        • We recommend that you convert filter conditions to bitmap indexes. By default, bitmap indexes are built for all fields of the TEXT type.
        • We recommend that you do not specify a field that has a large number of values for this property, such as the user_id field. Instead, you can specify fields such as activity_id.
      • Set the event_time_column property to the fields that are used for real-time data writing, such as the field that stores event timestamps.
      • Set the clustering_key property to the field for which Hologres builds a clustered index. Hologres sorts data based on clustered indexes. You can accelerate RANGE and FILTER queries on the index field by using the clustered index. Only a single clustered index is supported for each table. This property is available to filter data by range. For example, this property can be used for the distribution of gross merchandise volume (GMV).
  • Foreign tables

    Hologres allows you to use foreign tables to accelerate queries on data in source MaxCompute tables. This simplifies the data synchronization process. To improve computing performance, we recommend that you do not associate foreign tables with internal tables. Take note of the following standards and recommendations when you configure a foreign table for better foreign table management:

    • Standard that you must comply with: Follow the naming conventions and use the name of a source MaxCompute table suffixed with ext as the name of a foreign table.
    • Recommendations
      • Retain the DDL statement that is used to create the foreign table and manage different versions of the table.
      • Do not associate the foreign table with an internal table. Instead, synchronize data from the foreign table to the internal table.
  • Views
    • Standard that you must comply with: Follow the naming conventions.
    • Recommendations
      • Schedule tasks to configure dependencies in the entire business process.
      • Create individual views for tables that store data at different granularities to reduce the amount of data in computing.

        For example, you can create the cw (calendar week), cm (calendar month), nd (N days), and 1d (one day) views for four tables that store data at different granularities. If tables store data for different terminals, you can create the pc, wap, and app views. If tables store data for different data collection methods, you can create the ut and non-ut views.

  • Lifecycle of internal tables
    Layer Recommended lifecycle
    DWD For a table that stores day-level incremental data, we recommend that you set the lifecycle to a period of up to two years.
    DWS For a table that stores day-level incremental data, we recommend that you set the lifecycle to a period of up to two years.
    DIM For a large dimension table that is stored in ultimate storage mode, we recommend that you set the lifecycle to permanent. For a small dimension table, we recommend that you set the lifecycle to that of the source MaxCompute table.

    The difference between large and small dimension tables lies in the size of a single partition. If the size of a partition in a table exceeds 1 TB, the table is considered a large dimension table.

    Recommendations

    For a partitioned table, we recommend that you write real-time data to the partition generated on the same day and configure an appropriate time-to-live (TTL) period based on the layer. Do not write updated data to a partition whose TTL period exceeds the specified TTL period.

  • Table groups (optional)
    Each database has default table groups and a default shard count. You can create table groups or change the shard count based on your business requirements for better performance. Recommendations:
    • Do not create table groups unless necessary.
    • Create a table group with a large shard count for tables that contain large amounts of data.
    • Create a table group with a small shard count for a large number of tables that contain small amounts of data.
    • Put tables involved in join queries in the same table group.

Field setting standards

  • Field types

    The following table describes the field type standards that you must comply with.

    Field name or suffix Description Example Type or format
    user_id An auto-increment user ID field. user_id=232442843 int8
    item_id A field about the product. item_id=63283278784383 int8
    member_id A member ID field. member_id=b2b-dsajk2343821b TEXT
    *amt* A field about the amount. pay_ord_amt_1d_001=923.23 NUMERIC
    *fee* A field about the cost. post_fee=923.23 NUMERIC
    *cnt* A field about the quantity. pay_ord_byr_cnt_1d_001=923 int4/int8
    is_* A Boolean field. is_pm=Y/is_pm=true TEXT/BOOL
    ds A partition field. ds=20210120 YYYYMMDD
  • Data types

    The data types of Hologres are compatible with those of PostgreSQL. Hologres supports some of the data types of PostgreSQL. For more information about Hologres data types and the mappings between MaxCompute and Hologres data types, see Data types.

  • Currency and precision

    In regions within the Chinese mainland, use CNY as the currency unit. In other regions, use USD as the currency unit. Unless specifically stated, do not round data related to the amount of money. This can prevent inconsistent results of subsequent calculation.

SQL syntax

  • Standards that you must comply with:
    • Use the SELECT * statement only for fields that need to be calculated in the outmost query and subquery. Specify field names for all operations.
    • Use the COALESCE expression for empty fields and strings in WHERE clauses.
  • Recommendations
    • Use the count (distinct) method to set the distribution key. If you use the count (distinct) method multiple times, rewrite the statement. Example:
      select count(distinct userid)
           , count(distinct case when stat_date = '20201111' then userid end) 
      from t group by cate_id;
      
      Rewrite the statement to:
      select count(1), sum(c) from 
      (
        select userid
             , cate_id
             , cast(count(case when stat_date = '20201111' then 1 end) > 0) as c 
        from t 
        group by cate_id, userid
      ) t1 
      group by cate_id;
    • Execute the ANALYZE TABLE statement to analyze partitioned tables for offline scheduling tasks.
    • If your tables are required for an extended period of time, use the detach and attach methods to manage historical partitions to avoid radical changes in metrics.