Previously, we talked about some of the best practices of designing MaxCompute tables according to your specific application scenario. In this article, we will be showing you how to specify your tables design on Alibaba Cloud MaxCompute.

Typically when designing tables for big data applications, you should consider three things: storage, compute, and maintenance. For enterprises, it is sensible to consider a design that maximizes storage space and computing power, while keeping maintenance and overall costs at a minimum.
Affected operations: table creation/data entry/table update/table deletion/table management.
Import data scenarios (determine whether to perform real-time data collection or offline batch data writing):
Note:
The ODS layer of the source table: all the data synchronized daily from the business system is reserved, and the lifecycle definition is kept permanently. In case downstream data is compromised, it can be recovered from ODS. If ODS synchronizes with a full table daily, storage can be compressed by zipping the full table.
Data warehouse (basic) layer: keep at least one complete full amount of data (no need to be as redundant as ODS). For performance reasons, consider splitting tables or partitioning them.
Data mart layer: 1-3 years as necessary. The data from the data mart is easier to generate, so you don't need to keep the historical data for a very long time.
Customer attributes and product attributes change daily, and the historical changes of these attributes are recorded to facilitate tracing the values to a certain point in time.
The fields of the dimension table are redundant in the fact table, that is, the various dimension attribute values "when an event occurs" are bound to the event. It is more convenient for users to use data in the data application layer without associating multiple tables.
Record changes to dimension tables in the form of zipper tables or daily snapshots. With flexible data structure, convenient extension, simple management, and better data consistency, it is more convenient for data processors. It is used in the data base layer.
Channel type:
Note:

As shown above, the table has a total of m first-level partitions. Each first-level partition stores second-level partitions by time, and each second-level partition stores all columns. Considerations for designing partitions:
All table names and field names must follow a unified naming convention.
General rules for the design and use of partitions:
The role of the partition field is to facilitate data management and to divide the scope of data scans.
When creating a table, you can set normal fields and partition fields. In most cases, normal fields can be interpreted as the data of data files, while partition fields can be interpreted as directories of file systems. The storage footprint of a table is the footprint of a common field.
Partition columns do not store data directly, but they are just like directories in a file system that facilitate data management. At the same time, if a specific partition is specified during computation, the computation process only needs to query the corresponding partitions, thus reducing the amount of computation input. The number of partition columns in the partitioned table cannot exceed 6, that is, the number of directory layers for storing data in the underlying layer cannot exceed 6 levels. Set the appropriate lifecycle for the partitioned table. The periodic management of some data can be implemented through fine-grained partitioning.
Note:
Sort by priority:
For example:
Select ... from table where id=123 and .... ;
Data in a single partition is recommended to be submitted once for 64 M data. For the multi-level partition, ensure that the data of the second-level partition under each of the most fine-grained level partitions is in accordance with this rule.
The number of single table partitions (including lower-level partitions) cannot exceed 60,000.
Partition clipping can be used in computation, which is the advantage of partitioning. It is recommended that the amount of data in a single partition should not be too large. For example, the number of data entries in a single partition can be 10,000, but 50,000 partitions have been built.
Effort should be made to avoid partition data skew, where the amount of data entries in different partitions of a single table varies by more than 1 million. When designing partitions, the number of partitions should be reasonably planned, because fine-grained partitions affect the execution performance of SQL when scanning across partitions.
In cases where the amount of data in a single partition is large, MaxCompute will do sharding while performing tasks without affecting the advantages of partition clipping. A large number of files in a single partition affects the number of MaxComputeInstance, resulting in resource waste and SQL performance degradation.
Multi-level partitioning is adopted, first by date and then by transaction type. The table is split. Each transaction type is separated into a separate table, and each table is partitioned by date. No partitioning is performed on dimension tables.
To learn more about Alibaba Cloud MaxCompute, visit https://www.alibabacloud.com/product/maxcompute
137 posts | 21 followers
FollowAlibaba Cloud MaxCompute - February 14, 2019
amap_tech - March 16, 2021
Alibaba Cloud Community - November 1, 2022
Alibaba Cloud Community - March 7, 2022
Alibaba Cloud Indonesia - March 17, 2023
Alibaba Cloud MaxCompute - May 30, 2019
137 posts | 21 followers
Follow
Big Data Consulting for Data Technology Solution
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn More
MaxCompute
Conduct large-scale data warehousing with MaxCompute
Learn More
Big Data Consulting Services for Retail Solution
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn More
ApsaraDB for HBase
ApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba Cloud MaxCompute
Raja_KT January 27, 2019 at 3:08 pm
Perchance, you can provide do's and dont's on Star Schema, Snowflake, snapshot tables Slowly Changing Dimensions,Rapidly Changing Dimensions,Junk DimensionsInferred Dimensions,Conformed Dimensions,Degenerate Dimensions,Role Playing Dimensions,Shrunken Dimensions,Static Dimensions etc. if need be