In the new retail industry, granularity to store order transaction data tables in the data warehouse is the most common issue. There are three common granularities: (1) store order data by Stock Keeping Unit (SKU); (2) store order data by Standard Product Unit (SPU); (3) store order data by transaction order. The third mode stores order data by transaction order, which is more suitable for transaction detail data tables than for data warehouse storage. Therefore, this article mainly describes how to merge an SKU table and an SPU table into one table.
Why do the data tables store data using two different granularities? Generally, it is caused by two determinants. The first determinant is related to the characteristics of the data analysis business for the retail industry. The second is related to data volume.
1) Characteristics of the Data Analysis Business for the Retail Industry
2) Data Volume
From the perspectives of industry characteristics and data volume, the preceding sections analyze why a company often has SKU and SPU sales data tables. In addition, there are many situations where SPU is used for inventory, SKU for sales, SPU for unified distribution, and SKU for delivery.
However, in the field of data warehouse analysis, it is often necessary to merge data tables using different granularities to obtain integrated analysis results. For example, make an integrated analysis of the current year's commodity sales data and historical cumulative commodity sales data, or make a correlation analysis of the current year's commodity sales data and the current year's inventory data. In these cases, integrated analysis or correlation analysis is often performed on the SKU table and SPU table.
This article introduces a simple data preprocessing method that merge an SKU table and an SPU table into one table, as well as provides some ideas for easy reference.
Consider two data tables, sku_table and spu_table, storing commodity sales data by SKU and SPU respectively.
--Statements to create tables create table if not exists test_project.sku_table ( shopno string comment 'shop number', prodno string comment 'product number', sizeno string comment 'size number', rtam_lastyy Double comment 'the current year’s sales of a commodity using SKU granularity' ) comment 'SKU table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200; create table if not exists test_project.spu_table ( shopno string comment 'shop number', prodno string comment 'product number', rtam_lishileiji Double comment 'historical cumulative sales of a commodity using SPU granularity' ) comment ' SPU table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200;
Note: In the preceding data table, prodno is the SPU. The combination of prodno and sizeno becomes SKU.
For sku_table, this table uses shopno + prodno + sizeno as the primary key to indicate one data record. One piece of commodity sales data is recorded with "store + SPU + specification/color" of one sales record. This is common in case of new retail industry’s data warehouse.
For spu_table, this table uses shopno + prodno as the composite primary key to indicate one piece of data. One piece of commodity sales data is recorded with "store + SPU" of one sales record, which is common in the data warehouse of the new retail industry.
select * from test_project.sku_table where ds = '20200417';
select * from test_project.spu_table where ds = '20200417';
First, create a data table to store the merged data as shown below.
create table if not exists test_project.sku_spu_merge_table ( shopno string comment 'shop number', prodno string comment 'product number', sizeno string comment 'size number', rtam_lastyy Double comment 'the current year’s sales of a commodity using SKU granularity', rtam_lishileiji Double comment 'historical cumulative sales of a commodity using SPU granularity' ) comment 'merge SKU table (the current year’s sales) and SPU table (historical cumulative sales) into one table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200;
Now, import the two tables to the test_project.sku_spu_merge_table table as shown below.
insert overwrite table test_project.sku_spu_merge_table partition (ds = '20200417') select COALESCE (v1.shopno, v2.shopno) as shopno, COALESCE (v1.prodno, v2.prodno) as prodno, v1.sizeno, v1.rtam_lastyy, v2.rtam_lishileiji from ( select shopno, prodno, sizeno, rtam_lastyy, row_number() over ( partition by shopno,prodno order by sizeno) size_order from test_project.sku_table where ds = '20200417' ) v1 full outer join ( select shopno, prodno, rtam_lishileiji, 1 join_order from test_project.spu_table where ds = '20200417' ) v2 on (v1.shopno = v2.shopno) and (v1.prodno = v2.prodno) and (v1.size_order = v2.join_order);
Let’s describe the SQL statement for more clarity.
1) First, use the row_number() over (partition by shopno,prodno order by sizeno) function to group sku_table by shopno,prodno (that is, store + SPU), then sort by sizeno, and assign the sorting result to size_order.
2) Next, add the last column join_order to sku_table and assign the fixed value 1.
3) Finally, perform a full outer join to sku_table and spu_table, with (v1.shopno = v2.shopno) and (v1.prodno = v2.prodno) and (v1.size_order = v2.join_order) as the join condition. In particular, the condition v1.size_order = v2.join_order is to associate the first row (that is, size_order=1) of sku_table after it is grouped by shopno,prodno with spu_table (that is, join_order=1).
The following figure shows the final result.
Finally, sku_table and spu_table are merged into one table.
The advantage of the preceding method is that when integrated data analysis is performed, such as summation and averaging, the system does not encounter calculation errors due to repeated calculation of SPU data.
The data integration method for SKU and SPU data presented in this article is only a reference. You may build on the ideas mentioned in this article using other methods. For example, SPU sales data is calculated according to the number of SKUs under the SPU and then assigned to the merged data table. Another idea is to perform correlation analysis for the data table by using SPU granularity and the order list to obtain the sales data of each SKU under the SPU and then merge both the data tables using SKU granularity. Just decide which scheme to use based on the actual business situation and project requirements.
The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Alibaba Clouder - January 21, 2020
andy.zh - November 7, 2019
AliCloud-TechLab - August 25, 2021
ITDSN - May 11, 2020
Alibaba Clouder - February 14, 2019
digoal - June 26, 2019
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
Secure and easy solutions for moving you workloads to the cloudLearn More
This all-in-one omnichannel data solution helps brand merchants formulate brand strategies, monitor brand operation, and increase customer base.Learn More
More Posts by Alibaba Clouder