The operational data store (ODS) layer stores the raw data that is collected from your business systems. Such data is the source of upper-layer data. Data in a business system is application-oriented data that has been accumulated over a long period of time. It is very detailed and is frequently accessed.

Note Before creating tables in a MaxCompute data warehouse, you need to know the data types supported by MaxCompute.

Design ODS tables

In this tutorial, data at the ODS layer includes the order details, user details, and product details in the transaction data system. Such data is raw data. Logically, the raw data is stored in two-dimensional tables. Strictly speaking, the ODS layer is excluded from the data modeling of a data warehouse. However, you still need to properly plan the ODS layer and synchronize data. This tutorial uses the following six ODS tables as an example:
  • s_auction: records the information about products for auction.
  • s_sale: records the information about products for sale.
  • s_users_extra: records the detailed information about users.
  • s_biz_order_delta: records the information about new product orders.
  • s_dynamics_order_delta: records the information about new logistics orders.
  • s_pay_order_delta: records the information about new payment orders.
Note
  • We recommend that the naming conventions for tables and fields at the ODS layer be consistent with those in business systems. You can use tags to distinguish incremental tables from full tables. For example, the _delta tag is used to identify an incremental table.
  • Pay special attention to name conflicts when naming a table. For example, tables in different business systems may use the same name. To distinguish between two different tables with the same name, you can use the source database name of each table as the suffix or prefix of the table name. If the name of a field in a table is the same as a keyword, you can add _col1 to the field name as the suffix.

Follow ODS layer design specifications

You need to follow the naming conventions for tables and data synchronization nodes, and the specifications for data storage, lifecycle management, and data quality at the ODS layer.

Create tables

You can use the following statements to create tables at the ODS layer. For more information about the sample data, see Appendix.
CREATE TABLE IF NOT EXISTS s_auction
(
    id                             STRING COMMENT 'Product ID',
    title                          STRING COMMENT 'Product name',
    gmt_modified                   STRING COMMENT 'Last modification date of the product',
    price                          DOUBLE COMMENT 'Product price in RMB',
    starts                         STRING COMMENT 'Time when the product is put on the shelf',
    minimum_bid                    DOUBLE COMMENT 'Product opening bid in RMB',
    duration                       STRING COMMENT 'Validity period of the product for sale, in days',
    incrementnum                   DOUBLE COMMENT 'Increment of the auction price',
    city                           STRING COMMENT 'City of the product',
    prov                           STRING COMMENT 'Province of the product',
    ends                           STRING COMMENT 'Time when the product is taken off the shelf',
    quantity                       BIGINT COMMENT 'Quantity',
    stuff_status                   BIGINT COMMENT 'Product status: 0: new; 1: idle; 2: second-hand',
    auction_status                 BIGINT COMMENT 'Product sales status: 0: normal; 1: deleted; 2: taken off the shelf; 3: never put on the shelf',
    cate_id                         BIGINT COMMENT 'Product category ID',
    cate_name                        STRING COMMENT 'Product category name',
    commodity_id                     BIGINT COMMENT 'Commodity ID',
    commodity_name                    STRING COMMENT 'Commodity name',
    umid                              STRING COMMENT 'Buyer UMID'
)
COMMENT 'ODS table of products for auction'
PARTITIONED BY (ds         STRING COMMENT 'Format: YYYYMMDD')
LIFECYCLE 400;

CREATE TABLE IF NOT EXISTS s_sale
(
    id                             STRING COMMENT 'Product ID',
    title                          STRING COMMENT 'Product name',
    gmt_modified                   STRING COMMENT 'Last modification date of the product',
    starts                         STRING COMMENT 'Time when the product is put on the shelf',
    price                          DOUBLE COMMENT 'Product price in RMB',
    city                           STRING COMMENT 'City of the product',
    prov                           STRING COMMENT 'Province of the product',
    quantity                       BIGINT COMMENT 'Quantity',
    stuff_status                   BIGINT COMMENT 'Product status: 0: new; 1: idle; 2: second-hand',
    auction_status                 BIGINT COMMENT 'Product sales status: 0: normal; 1: deleted; 2: taken off the shelf; 3: never put on the shelf',
    cate_id                      BIGINT COMMENT 'Product category ID',
    cate_name                    STRING COMMENT 'Product category name',
    commodity_id                 BIGINT COMMENT 'Commodity ID',
    commodity_name                STRING COMMENT 'Commodity name',
    umid                          STRING COMMENT 'Buyer UMID'
)
COMMENT 'ODS table of products for sale'
PARTITIONED BY (ds      STRING COMMENT 'Format: YYYYMMDD')
LIFECYCLE 400;


CREATE TABLE IF NOT EXISTS s_users_extra
(
    id                STRING COMMENT 'User ID',
    logincount        BIGINT COMMENT 'Number of logons',
    buyer_goodnum     BIGINT COMMENT 'Number of favorable comments received as a buyer',
    seller_goodnum    BIGINT COMMENT 'Number of favorable comments received as a seller',
    level_type        BIGINT COMMENT '1: level-1 shop; 2: level-2 shop; 3: level-3 shop',
    promoted_num      BIGINT COMMENT '1: class-A service; 2: class-B service; 3: class-C service',
    gmt_create        STRING COMMENT 'Creation time',
    order_id          BIGINT COMMENT 'Order ID',
    buyer_id          BIGINT COMMENT 'Buyer ID',
    buyer_nick        STRING COMMENT 'Buyer nickname',
    buyer_star_id     BIGINT COMMENT 'Buyer star ID',
    seller_id         BIGINT COMMENT 'Seller ID',
    seller_nick       STRING COMMENT 'Seller nickname',
    seller_star_id    BIGINT COMMENT 'Seller star ID',
    shop_id           BIGINT COMMENT 'Shop ID',
    shop_name         STRING COMMENT 'Shop name'
)
COMMENT 'User information extended table'
PARTITIONED BY (ds       STRING COMMENT 'yyyymmdd')
LIFECYCLE 400;

CREATE TABLE IF NOT EXISTS s_biz_order_delta
(
    biz_order_id         STRING COMMENT 'Order ID',
    pay_order_id         STRING COMMENT 'Payment order ID',
    logistics_order_id   STRING COMMENT 'Logistics order ID',
    buyer_nick           STRING COMMENT 'Buyer nickname',
    buyer_id             STRING COMMENT 'Buyer ID',
    seller_nick          STRING COMMENT 'Seller nickname',
    seller_id            STRING COMMENT 'Seller ID',
    auction_id           STRING COMMENT 'Product ID',
    auction_title        STRING COMMENT 'Product title',
    auction_price        DOUBLE COMMENT 'Product price',
    buy_amount           BIGINT COMMENT 'Quantity of purchased products',
    buy_fee              BIGINT COMMENT 'Purchase order amount',
    pay_status           BIGINT COMMENT 'Payment status: 1: unpaid; 2: paid; 3: refunded',
    logistics_id         BIGINT COMMENT 'Logistics order ID',
    mord_cod_status      BIGINT COMMENT 'Logistics status: 0: initial state; 1: order received; 2: order receiving timed out; 3: package picked up; 4: package pickup failed; 5: package signed off; 6: package sign-off failed; 7: logistics order canceled by the user',
    status               BIGINT COMMENT 'Order status: 0: normal; 1: invisible',
    sub_biz_type         BIGINT COMMENT 'Business type: 1: auction; 2: purchase',
    end_time             STRING COMMENT 'Transaction end time',
    shop_id              BIGINT COMMENT 'Shop ID'
)
COMMENT 'Daily incremental table of successful orders'
PARTITIONED BY (ds       STRING COMMENT 'yyyymmdd')
LIFECYCLE 7200;


CREATE TABLE IF NOT EXISTS s_logistics_order_delta
(
    logistics_order_id STRING COMMENT 'Logistics order ID',
    post_fee           DOUBLE COMMENT 'Logistics fee',
    address            STRING COMMENT 'Shipping address',
    full_name          STRING COMMENT 'Recipient full name',
    mobile_phone       STRING COMMENT 'Mobile number',
    prov               STRING COMMENT 'Province',
    prov_code          STRING COMMENT 'Province ID',
    city               STRING COMMENT 'City',
    city_code          STRING COMMENT 'City ID',
    logistics_status   BIGINT COMMENT 'Logistics status:
1: not shipped
2: shipped
3: received
4: returned
5: picking',
    consign_time       STRING COMMENT 'Shipping time',
    gmt_create         STRING COMMENT 'Order creation time',
    shipping           BIGINT COMMENT 'Shipping method:
1: surface mail
2: express delivery
3: EMS',
    seller_id          STRING COMMENT 'Seller ID',
    buyer_id           STRING COMMENT 'Buyer ID'
)
COMMENT 'Daily incremental table of logistics orders'
PARTITIONED BY (ds                 STRING COMMENT 'Date')
LIFECYCLE 7200;


CREATE TABLE IF NOT EXISTS s_pay_order_delta
(
    pay_order_id     STRING COMMENT 'Payment order ID',
    total_fee        DOUBLE COMMENT 'Total amount payable (Quantity × Unit price)',
    seller_id STRING COMMENT 'Seller ID',
    buyer_id  STRING COMMENT 'Buyer ID',
    pay_status       BIGINT COMMENT 'Payment status:
1: waiting for payment
2: waiting for shipping
3: transaction succeeded',
    pay_time         STRING COMMENT 'Payment time',
    gmt_create       STRING COMMENT 'Order creation time',
    refund_fee       DOUBLE COMMENT 'Refund amount (freight included)',
    confirm_paid_fee DOUBLE COMMENT 'Total revenue of received and confirmed products'
)
COMMENT 'Daily incremental table of payment orders'
PARTITIONED BY (ds        STRING COMMENT 'Date')
LIFECYCLE 7200;

Store data at the ODS layer

To analyze historical data, you can add the time dimension as a partition field to an ODS table. In practice, you can select incremental, full, or zipper storage to store data in ODS tables.
  • Incremental storage
    In incremental storage mode, data is stored in units of days and uses the data timestamp as a partition field. Each partition stores daily incremental business data. For example:
    • On January 1, 2016, User A visited E-commerce shop B of Company A. Log entry t1 was generated in the e-commerce logs of Company A. On January 2, User A visited E-commerce shop C of Company A. Log entry t2 was generated in the e-commerce logs of Company A. In incremental storage mode, t1 is stored in the 20160101 partition, and t2 is stored in the 20160102 partition.
    • On January 1, 2016, User A purchased Product B on the e-commerce website of Company A. Log entry t1 was generated in the transaction logs of Company A. On January 2, User A returned Product B. Log entry t1 was updated in the transaction logs of Company A. In incremental storage mode, the initial log entry t1 is stored in the 20160101 partition, and the updated log entry t1 is stored in the 20160102 partition.
    Note Incremental storage is suitable for ODS tables that store highly transactional data such as transactions and logs. These tables store a large amount of data and require high storage costs if full storage is adopted. In addition, the descendant applications of these tables seldom access historical full data. The requirement for accessing full data can be met through the aggregation of data in the data warehouse. For example, the ODS tables of logs are not updated, so you can obtain full data by aggregating all incremental partitions.
  • Full storage

    In full storage mode, data is stored in units of days and uses the data timestamp as a partition field. Each partition stores full business data as of the data timestamp. For example:

    On January 1, 2016, Seller A posted Products B and C on the e-commerce website of Company A. Data entries t1 and t2 were generated in the front-end product table. On January 2, Seller A took Product B off the shelf and posted Product D. In the front-end product table, data entry t1 was updated and data entry t3 was generated. In full storage mode, data entries t1 and t2 are stored in the 20160101 partition, and the updated data entry t1 and data entries t2 and t3 are stored in the 20160102 partition.
    Note Full storage is suitable for slowly changing dimensions (SCDs) that involve a small amount of data, such as the product category.
  • Zipper storage

    In zipper storage mode, two timestamp fields start_dt and end_dt are added to record all data changes in units of days. Generally, the two timestamp fields are also used as partition fields.

    The following table shows how to store data in zipper storage mode.
    Product start_dt end_dt Seller Status
    B 20160101 20160102 A Put on the shelf
    C 20160101 30001231 A Put on the shelf
    B 20160102 30001231 A Taken off the shelf

    In this way, descendant applications can obtain historical data by specifying timestamp fields. For example, if you want to access the data of January 1, you only need to specify the timestamp fields start_dt and end_dt as follows: start_dt <= 20160101 and end_dt > 20160101.

Process SCDs

We recommend that you use a natural key, but not a surrogate key, as the primary key of a MaxCompute dimension table. The reasons are as follows:
  1. MaxCompute is a distributed compute engine. The workload is heavy for MaxCompute to generate a globally unique surrogate key. Especially in the case of a large amount of data, the generation of a surrogate key is more complicated and unnecessary.
  2. Using a surrogate key increases the complexity of the extract-transform-load (ETL) processing and adds the cost of developing and maintaining ETL jobs.
Without using a surrogate key, you can process SCDs in one of the following ways:
  1. Snapshot

    Data is usually computed once a day. Based on this frequency, a full snapshot is taken for SCDs every day.

    For example, a full snapshot of product data is generated every day for the product dimension. You can obtain the product information of the day or the latest product information from the fact table of any day. To achieve this, you can specify a date and use the natural key of the product dimension table to associate the snapshot data of the fact table with that of the dimension table. The snapshot mode has the following advantages:
    • This mode is simple and effective to handle SCDs. It features low development and maintenance costs.
    • This mode is easy to use and understand. You can easily obtain the snapshot data of the day by specifying a date. The fact table snapshot of any day can be associated with the dimension table snapshot of that day by using the natural key of a dimension table.

    However, the disadvantage of this mode is that it greatly wastes the storage space. For example, if the changes of a dimension account for a low proportion to the total data volume or the dimension remains unchanged in extreme cases, the snapshot data of the dimension still occupies a large storage space. This mode improves the ETL efficiency and simplifies the ETL logic at the expense of the storage space. We recommend that you use this mode properly. You must establish a data lifecycle mechanism to clear unnecessary historical data.

  2. Zipper storage (Ultimate storage)

    Zipper storage is described in the previous section of this topic. This mode is not easy to understand. In addition, this mode specifies the effective date and generates a large number of partitions. Due to the upper limit on the number of partitions in MaxCompute, this mode is not helpful for you to maintain the data warehouse in a long term.

    The underlying data is stored in ultimate storage mode. The upper layer uses views or hooks and the syntax tree of analytical statements to convert queries about table data that is not stored in ultimate storage mode to queries about table data that is stored in ultimate storage mode. For descendant applications, the data stored in ultimate storage mode is the same as that stored in full storage mode. In addition, to reduce the number of partitions generated for historical zipper tables, ultimate storage changes daily partitions to monthly partitions. That is, data is stored in units of months in zipper storage mode.

    Ultimate storage can save a large storage space, but it is difficult to use. The difficulty lies in maintaining the full table and filtering frequently changed dimension attributes.

Generally, we recommend that you use the snapshot mode to process SCDs. If a large amount of data is involved, we recommend that you use the zipper storage (ultimate storage) mode.

Synchronize data

Data at the ODS layer must be synchronized from various data source systems to MaxCompute for further development. We recommend that you use the data integration feature of DataWorks to synchronize data. When using data integration, obey the following rules:
  • To maintain the consistency of the table schema, you can synchronize data from the source tables of a system to MaxCompute only once.
  • After data is synchronized by using data integration, the synchronized full data is directly integrated into the partition of the current day in the full table.
  • We recommend that you store data at the ODS layer in partition tables by statistical date and time to reduce data storage costs and impose policy control.
  • Data integration can adapt fields in the source system to MaxCompute as follows:
    • If a field in the source system does not exist in the destination table in MaxCompute, data integration can automatically add the field to the destination table.
    • If a field of the destination table does not exist in the source system, data integration fills it with NULL.