All Products
Search
Document Center

Hologres:Hologres development standards

Last Updated:Jun 16, 2023

This topic describes the Hologres development standards. You can prevent operation errors by complying with these standards.

Data domains

  • Data layers

    A data warehouse is divided into the operational data store (ODS), common data model (CDM), and application data store (ADS) layers. The CDM layer consists of the data warehouse detail (DWD), data warehouse service (DWS), and dimension data (DIM) layers. In Hologres, layers are distinguished by schema.

    • ODS (Operational Data Store)

    • CDM (Common Data Model)

      • DWD (Data Warehouse Detail)

      • DWS (Data Warehouse Summary)

      • DIM (Dimension)

    • ADS (Application Data Service)

    You can divide the data warehouse of your enterprise into layers at different levels. If your enterprise has different business units, you can divide the data warehouse by business unit and use the abbreviation of a business unit name as the prefix of the corresponding schema to distinguish layers.

    create schema ${bu}_ads;
    create schema ${bu}_ads_dev;
    create schema ${bu}_dwd;
    create schema ${bu}_dwd_dev;
    create schema ${bu}_dws;
    create schema ${bu}_dws_dev;
    create schema ${bu}_dim;
    create schema ${bu}_dim_dev;
    create schema ${bu}_ods;
    create schema ${bu}_ods_dev;
  • Abbreviations of data domains

    We recommend that you specify an abbreviation for each data domain and enforce the use of abbreviations within your enterprise to ensure consistency. The following table describes some abbreviations of data domains that you can use.

    Data domain

    Sample abbreviation

    Transaction

    trd

    Product

    itm

    Log

    log

    Member and store

    mbr

    Purchase, stock, and sale management

    dst

    Sale and customer service

    crm

    Credit and risk control

    rsk

    Tool and service

    tls

    Logistics and express delivery

    lgt

Naming conventions

  • Jobs

    To distinguish between internal and synchronization jobs, you must comply with the following naming conventions:

    • Internal asynchronous SQL jobs: holo_{target_table_name}. This distinguishes internal SQL jobs from jobs that involve external tables.

    • Data import jobs: {source}2holo_{target_table_name}.

    • Data export jobs: holo2{target}_{target_table_name}.

  • Tables

    Layer

    Format

    Example

    DWD

    ${bu}_dwd.Data domain_Business process_[Custom root]_Suffix

    taobao_dwd.trd_ord_flow

    DWS

    ${bu}_dws.Data domain_Abbreviated statistic granularity_Business process_[{Custom root}]_Statistical period

    taobao_dws.trd_all_dtr, taobao_dws.log_slr_pv_dtr

    DIM

    ${bu}_dim.{Dimension}[_{Custom root}]

    taobao_cdm.dim_itm

    ADS

    ${bu}_ads.Business domain_Dimension_[{Custom root}]_{Update frequency}

    Note

    Valid values of the Update frequency variable:

    • d: daily update

    • r: real-time update

    • h: quasi-real-time update

    taobao_ads.trd_cate_d

  • Table groups

    If you require multiple table groups, you can name the table groups in the following format: ${bu}_{Layer}_{Business}_tg.

  • Views

    Views can be named in the following formats:

    • Formats

      • DWS: ${bu}_dws.Data domain_Abbreviated statistic granularity_Business process_[{Custom root}]_Statistical period_v

      • ADS: ${bu}_ads.Business domain_Dimension_[{Custom root}]_{Update frequency}_v

    • Example

      taobao_dws.trd_byr_itm_ord_cm_v
  • Foreign tables

    We recommend that you use the name of the source MaxCompute table suffixed with ext as the name of a foreign table. Example:

    taobao_dim.camp_ext
  • Temporary tables

    We recommend that you use the name of the original table prefixed with tmp and suffixed with a sequence number as the name of a temporary table. Example:

    taobao_dim.tmp_camp_01
  • Abbreviations

    Statistical period

    Abbreviation

    Last one day

    1d

    Last N days

    nd

    Total days

    td

    Calendar week

    cw

    Calendar month

    cm

    Accrual to the current day

    dtr

    Accrual to the current hour

    dhr

Table configuration standards

  • Internal tables

    When you create an internal table, you must comply with the naming conventions for the table and its fields. In addition, you must set the lifecycle of the table based on your business requirements and add comments for the table and its fields. Take note of the following standards and recommendations when you configure an internal table:

    • Standards that you must comply with to ensure successful creation of the internal table

      • Add concise and accurate comments for the table and its fields to facilitate data development and analysis.

      • Specify the time_to_live_in_seconds property in the statement used to create the table. This property is used to specify the lifecycle of the table.

      • Specify the distribute_key property in the statement used to create the table. This property is used to specify the distribution key.

        The selected distribution key must have sufficient cardinality and is most frequently used for the JOIN and GROUP BY operations. For example, in a table that stores product data, you may select the user_id and item_id fields as the distribution key. However, if the user_id field is most frequently used to join data, you must select only the user_id field as the distribution key.

      • Create tables involved in join queries within the same table group.

      • Use the same name and data type for the same entity ID in all fact tables and dimension tables. For example, if you specify user_id as the user ID in a transaction table, you must also use user_id rather than uid as the user ID in a dimension table. Consistency in data type eliminates the need to convert data types.

      • Use ds for the partition fields of all physical tables.

    • Recommendations

      • Specify one of the bitmap_columns, segment_key, and cluster_key properties in the statement used to create the table.

      • Do not specify the dictionary_encoding_columns property if you are unsure about the cardinality of fields in the table. This property is used to build dictionary mappings. You can execute the following statement to skip dictionary mappings:

        call set_table_property('table_name', 'dictionary_encoding_columns','')
      • Set the orientation property to column to set the table as a column-oriented table. The row-oriented storage type is also supported by Hologres.

        Note

        Do not set the table as a row-oriented table unless all the primary key columns of the table can be specified in each query on the table by using the equal to or IN operator. By default, if the orientation property is not specified, the column-oriented storage type is used.

      • Set the bitmap_columns property. This property is used to specify the fields for which Hologres builds bitmap indexes. Bitmap indexes can be used to filter stored data.

        • We recommend that you convert filter conditions to bitmap indexes. By default, bitmap indexes are built for all fields of the TEXT type.

        • We recommend that you do not specify a field that has a large number of values for this property, such as the user_id field. Instead, you can specify fields such as activity_id.

      • Set the event_time_column property to the fields that are used for real-time data writing, such as the field that stores event timestamps.

      • Set the clustering_key property to the field for which Hologres builds a clustered index. Hologres sorts data based on clustered indexes. You can accelerate RANGE and FILTER queries on the index field by using the clustered index. Only a single clustered index is supported for each table. This property is available to filter data by range. For example, this property can be used for the distribution of gross merchandise volume (GMV).

  • Foreign tables

    Hologres allows you to use foreign tables to accelerate queries on data in source MaxCompute tables. This simplifies the data synchronization process. To improve computing performance, we recommend that you do not associate foreign tables with internal tables. Take note of the following standards and recommendations when you configure a foreign table for better foreign table management:

    • Standard that you must comply with: Follow the naming conventions and use the name of a source MaxCompute table suffixed with ext as the name of a foreign table.

    • Recommendations

      • Retain the DDL statement that is used to create the foreign table and manage different versions of the table.

      • Do not associate the foreign table with an internal table. Instead, synchronize data from the foreign table to the internal table.

  • Views

    • Standard that you must comply with: Follow the naming conventions.

    • Recommendations

      • Schedule tasks to configure dependencies in the entire business process.

      • Create individual views for tables that store data at different granularities to reduce the amount of data in computing.

        For example, you can create the cw (calendar week), cm (calendar month), nd (N days), and 1d (one day) views for four tables that store data at different granularities. If tables store data for different terminals, you can create the pc, wap, and app views. If tables store data for different data collection methods, you can create the ut and non-ut views.

  • Lifecycle of internal tables

    Layer

    Recommended lifecycle

    DWD

    For a table that stores day-level incremental data, we recommend that you set the lifecycle to a period of up to two years.

    DWS

    For a table that stores day-level incremental data, we recommend that you set the lifecycle to a period of up to two years.

    DIM

    For a large dimension table that is stored in ultimate storage mode, we recommend that you set the lifecycle to permanent. For a small dimension table, we recommend that you set the lifecycle to that of the source MaxCompute table.

    The difference between large and small dimension tables lies in the size of a single partition. If the size of a partition in a table exceeds 1 TB, the table is considered a large dimension table.

    Recommendations

    For a partitioned table, we recommend that you write real-time data to the partition generated on the same day and configure an appropriate time-to-live (TTL) period based on the layer. Do not write updated data to a partition whose TTL period exceeds the specified TTL period.

  • Table groups (optional)

    Each database has default table groups and a default shard count. You can create table groups or change the shard count based on your business requirements for better performance. Recommendations:

    • Do not create table groups unless necessary.

    • Create a table group with a large shard count for tables that contain large amounts of data.

    • Create a table group with a small shard count for a large number of tables that contain small amounts of data.

    • Put tables involved in join queries in the same table group.

Field setting standards

  • Field types

    The following table describes the field type standards that you must comply with.

    Field name or suffix

    Description

    Example

    Type or format

    user_id

    An auto-increment user ID field.

    user_id=232442843

    int8

    item_id

    A field about the product.

    item_id=63283278784383

    int8

    member_id

    A member ID field.

    member_id=b2b-dsajk2343821b

    TEXT

    *amt*

    A field about the amount.

    pay_ord_amt_1d_001=923.23

    NUMERIC

    *fee*

    A field about the cost.

    post_fee=923.23

    NUMERIC

    *cnt*

    A field about the quantity.

    pay_ord_byr_cnt_1d_001=923

    int4/int8

    is_*

    A Boolean field.

    is_pm=Y/is_pm=true

    TEXT/BOOL

    ds

    A partition field.

    ds=20210120

    YYYYMMDD

  • Data types

    The data types of Hologres are compatible with those of PostgreSQL. Hologres supports some of the data types of PostgreSQL. For more information about Hologres data types and the mappings between MaxCompute and Hologres data types, see Data types.

  • Currency and precision

    In regions within the Chinese mainland, use CNY as the currency unit. In other regions, use USD as the currency unit. Unless specifically stated, do not round data related to the amount of money. This can prevent inconsistent results of subsequent calculation.

SQL syntax

  • Standards that you must comply with:

    • Use the SELECT * statement only for fields that need to be calculated in the outmost query and subquery. Specify field names for all operations.

    • Use the COALESCE expression for empty fields and strings in WHERE clauses.

  • Recommendations

    • Use the count (distinct) method to set the distribution key. If you use the count (distinct) method multiple times, rewrite the statement. Example:

      select count(distinct userid)
           , count(distinct case when stat_date = '20201111' then userid end) 
      from t group by cate_id;
      
      Rewrite the statement to:
      select count(1), sum(c) from 
      (
        select userid
             , cate_id
             , cast(count(case when stat_date = '20201111' then 1 end) > 0) as c 
        from t 
        group by cate_id, userid
      ) t1 
      group by cate_id;
    • Execute the ANALYZE TABLE statement to analyze partitioned tables for offline scheduling tasks.

    • If your tables are required for an extended period of time, use the detach and attach methods to manage historical partitions to avoid radical changes in metrics.