The dimension data (DIM) layer defines conformed dimensions for an enterprise based on the concepts of dimensional modeling.

The DIM layer is composed of dimension tables. Dimensions are logically defined to measure and observe the business performance. Dimension tables are physically created on big data platforms based on dimensions and their attributes. They are designed as wide tables. To build the DIM layer, you must first define dimensions.

Define dimensions

When dividing data domains and building a bus matrix, you need to define dimensions based on the analysis of business processes. This tutorial takes the e-commerce marketing business unit of Company A as an example. In the transaction data domain, this tutorial focuses on the business process of receipt confirmation, where a transaction is successful.

In the business process of receipt confirmation, dimensions can be defined from two perspectives: product and place of receipt. This tutorial assumes that each product is purchased and received in the same place. From the product perspective, the following dimensions can be defined:
  • Product ID
  • Product name
  • Product price
  • Product status: 0: new; 1: idle; 2: second-hand
  • Product category ID
  • Product category name
  • Commodity ID
  • Commodity name
  • Buyer ID
  • Product sales status: 0: normal; 1: deleted; 2: taken off the shelf; 3: never put on the shelf
  • City of the product
  • Province of the product
From the place of receipt perspective, the following dimensions can be defined:
  • Buyer ID
  • City code
  • City name
  • Province code
  • Province name

As the core of dimensional modeling, dimensions must be unique in an enterprise-level data warehouse. For example, each dimension defined from the product perspective must be unique for Company A. The province code dimension delivers the same information in any business process.

Design dimension tables

After defining dimensions, you can supplement data to generate dimension tables. Bear the following suggestions in mind when designing dimension tables:
  • Limit the size of each dimension table to contain no more than 10 million data entries.
  • Use MAPJOIN to join dimension tables to other tables.
  • Avoid frequently updating dimension tables.
When designing dimension tables, you need to consider the following aspects:
  • The stability of data in dimension tables. For example, the e-commerce member data of Company A usually does not expire, but the data may be updated at any time. In this case, you need to create a partition to store full member data. If some data entries are never updated, you may need to create a historical table and a daily table. The daily table is used to store currently valid data entries and keep the table data at a proper size. In the historical table, data is inserted into the corresponding partition based on the expiration time. Each partition stores the data entries that expire at the corresponding time.
  • Whether vertical splitting is required. You need to split fields and create multiple dimension tables if a dimension table contains a large number of unused attributes, or data queries slow down because the table contains too many attribute fields.
  • Whether horizontal splitting is required. If some data entries are distinct from others, you can split a dimension table into multiple tables or design multi-level partitions.
  • The time limit on data queries. Generally, the execution time of data queries about core dimension tables is strictly limited.
Perform the following steps to design dimension tables:
  1. Initially define dimensions and ensure the conformity of dimensions.
  2. Specify the primary dimension table that constitutes the central fact table of the star schema used in this tutorial. The primary dimension table, which is an operational data store (ODS) table, directly synchronizes data from the relevant business system. For example, the s_auction table records the information about products and synchronizes data from the front-end product center system. This table is the primary dimension table.
  3. Specify relevant dimension tables. The data warehouse aggregates the data of source business systems. Dimension tables in different business systems or within the same business system are associated. Based on the analysis of the business logic, specify which tables are associated with the primary dimension table, and select some of the tables to generate dimension attributes. For example, the product dimension is associated with dimensions such as the category, seller, and shop.
  4. Specify dimension attributes in two phases. In the first phase, select dimension attributes or generate new dimension attributes from the primary dimension table. In the second phase, select dimension attributes or generate new dimension attributes from the relevant dimension tables. For example, select dimension attributes or generate new dimension attributes for the product dimension from the primary dimension table s_auction and the relevant dimension tables regarding the category, seller, and shop dimensions.
    • Generate as many dimension attributes as possible.
    • Provide as many meaningful descriptions as possible.
    • Differentiate numeric attributes from facts.
    • Find out as many common dimension attributes as possible.

Follow DIM layer design specifications

The naming convention for dimension tables at the DIM layer is dim_{Abbreviated business unit/pub}_{Abbreviated dimension}[_{Abbreviated custom table tag}]. In the naming convention, pub indicates that the dimension is irrelevant to any specific business unit and can be shared by each business unit, for example, the time dimension. For example:
  • Table of common regional information: dim_pub_area
  • Table of full products in the e-commerce marketing business unit of Company A: dim_asale_itm

Create tables

In this tutorial, you can use the following statements to create dimension tables at the DIM layer:
CREATE TABLE IF NOT EXISTS dim_asale_itm
(
    item_id                            BIGINT COMMENT 'Product ID',
    item_title                      STRING COMMENT 'Product name',
    item_price                     DOUBLE COMMENT 'Product price in RMB',
    item_stuff_status              BIGINT COMMENT 'Product status: 0: new; 1: idle; 2: second-hand',
    cate_id                          BIGINT COMMENT 'Product category ID',
    cate_name                        STRING COMMENT 'Product category name',
    commodity_id                      BIGINT COMMENT 'Commodity ID',
    commodity_name                  STRING COMMENT 'Commodity name',
    umid                           STRING COMMENT 'Buyer ID',
    item_status                    BIGINT COMMENT 'Product sales status: 0: normal; 1: deleted; 2: taken off the shelf; 3: never put on the shelf',
    city                           STRING COMMENT 'City of the product',
    prov                           STRING COMMENT 'Province of the product',
)
COMMENT 'Table of full products'
PARTITIONED BY (ds        STRING COMMENT 'Date: yyyymmdd');

CREATE TABLE IF NOT EXISTS dim_pub_area
(
    buyer_id       STRING COMMENT 'Buyer ID'
    city_code      STRING COMMENT 'City code',
    city_name      STRING COMMENT 'City name',
    prov_code      STRING COMMENT 'Province code',
    prov_name      STRING COMMENT 'Province name',
)
COMMENT 'Table of common regional information'
PARTITIONED BY (ds             STRING COMMENT 'Date: yyyymmdd')
LIFECYCLE 3600;