The dimension data (DIM) layer defines conformed dimensions for an enterprise based on the concepts of dimensional modeling.
The DIM layer is composed of dimension tables. Dimensions are logically defined to measure and observe the business performance. Dimension tables are physically created on big data platforms based on dimensions and their attributes. They are designed as wide tables. To build the DIM layer, you must first define dimensions.
Define dimensions
When dividing data domains and building a bus matrix, you need to define dimensions based on the analysis of business processes. This tutorial takes the e-commerce marketing business unit of Company A as an example. In the transaction data domain, this tutorial focuses on the business process of receipt confirmation, where a transaction is successful.
- Product ID
- Product name
- Product price
- Product status
0: new; 1: idle; 2: second-hand
- Product category ID
- Product category name
- Commodity ID
- Commodity name
- Product sales status:
0: normal; 1: deleted; 2: taken off the shelf; 3: never put on the shelf
- City of the product
- Province of the product
- City code
- City name
- Province code
- Province name
As the core of dimensional modeling, dimensions must be unique in an enterprise-level data warehouse. For example, each dimension defined from the product perspective must be unique for Company A. The province code dimension delivers the same information in any business process.
Design dimension tables
- Limit the size of each dimension table to contain no more than 10 million data entries.
- Use MAPJOIN to join dimension tables to other tables.
- Avoid frequently updating dimension tables.
- The stability of data in dimension tables.
For example, the e-commerce member data of Company A usually does not expire, but the data may be updated at any time. In this case, you need to create a partition to store full member data. If some data entries are never updated, you may need to create a historical table and a daily table. The daily table is used to store currently valid data entries and keep the table data at a proper size. In the historical table, data is inserted into the corresponding partition based on the expiration time. Each partition stores the data entries that expire at the corresponding time.
- Whether vertical splitting is required.
You need to split fields and create multiple dimension tables if a dimension table contains a large number of unused attributes, or data queries slow down because the table contains too many attribute fields.
- Whether horizontal splitting is required.
If some data entries are distinct from others, you can split a dimension table into multiple tables or design multi-level partitions.
- The time limit on data queries.
Generally, the execution time of data queries about core dimension tables is strictly limited.
- Initially define dimensions and ensure the conformity of dimensions.
- Specify the primary dimension table that constitutes the central fact table of the star schema used in this tutorial. The primary dimension table, which is an operational data store (ODS) table, directly synchronizes data from the relevant business system. For example, the s_auction table records the information about products and synchronizes data from the front-end product center system. This table is the primary dimension table.
- Specify relevant dimension tables. The data warehouse aggregates the data of source business systems. Dimension tables in different business systems or within the same business system are associated. Based on the analysis of the business logic, specify which tables are associated with the primary dimension table, and select some of the tables to generate dimension attributes. For example, the product dimension is associated with dimensions such as the category, seller, and shop.
- Specify dimension attributes in two phases. In the first phase, select dimension attributes or generate new dimension attributes from the primary dimension table. In the second phase, select dimension attributes or generate new dimension attributes from the relevant dimension tables. For example, select dimension attributes or generate new dimension attributes for the product dimension from the primary dimension table s_auction and the relevant dimension tables regarding the category, seller, and shop dimensions. From the dimension attributes, the following dimensions can be defined:
- Generate as many dimension attributes as possible.
- Provide as many meaningful descriptions as possible.
- Differentiate numeric attributes from facts.
- Find out as many common dimension attributes as possible.
Follow DIM layer design specifications
- Table of common regional information: dim_pub_area
- Table of full products in the e-commerce marketing business unit of Company A: dim_asale_itm
Create tables
CREATE TABLE IF NOT EXISTS dim_asale_itm
(
item_id BIGINT COMMENT 'Product ID',
item_title STRING COMMENT 'Product name',
item_price DOUBLE COMMENT 'Product price in RMB',
item_stuff_status BIGINT COMMENT 'Product status: 0: new; 1: idle; 2: second-hand',
cate_id BIGINT COMMENT 'Product category ID',
cate_name STRING COMMENT 'Product category name',
commodity_id BIGINT COMMENT 'Commodity ID',
commodity_name STRING COMMENT 'Commodity name',
item_status BIGINT COMMENT 'Product sales status: 0: normal; 1: deleted; 2: taken off the shelf; 3: never put on the shelf',
city STRING COMMENT 'City of the product',
prov STRING COMMENT 'Province of the product',
)
COMMENT 'Table of full products'
PARTITIONED BY (ds STRING COMMENT 'Date: yyyymmdd');
CREATE TABLE IF NOT EXISTS dim_pub_area
(
city_code STRING COMMENT 'City code',
city_name STRING COMMENT 'City name',
prov_code STRING COMMENT 'Province code',
prov_name STRING COMMENT 'Province name',
)
COMMENT 'Table of common regional information'
PARTITIONED BY (ds STRING COMMENT 'Date: yyyymmdd')
LIFECYCLE 3600;