In this article, we will discuss several methods to specify your tables on Alibaba Cloud MaxCompute.
Previously, we talked about some of the best practices of designing MaxCompute tables according to your specific application scenario. In this article, we will be showing you how to specify your tables design on Alibaba Cloud MaxCompute.
MaxCompute Table Restrictions
Table Design Specification - Main Goals of Table Design
Typically when designing tables for big data applications, you should consider three things: storage, compute, and maintenance. For enterprises, it is sensible to consider a design that maximizes storage space and computing power, while keeping maintenance and overall costs at a minimum.
Reduce storage costs. Appropriate table design can reduce redundant storage in hierarchical data designs, thus reducing the amount of data in intermediate tables. At the same time, it can correctly manage the lifecycle of the table data, and can directly reduce the amount of data stored to reduce the storage cost.
Reduce computing costs. Standardize the table design to optimize the reading of the data according to these specifications in the subsequent query and computation of the table data, reducing redundant reading, writing and computing in the computation process and reducing costs while improving the computation performance.
Reduce the maintenance complexity. Standardized table hierarchical design can directly reflect the characteristics of the business. For example, by optimizing the data collection method in the data channel and standardizing the table design, problem with small and medium-sized files in the distributed system as well as the number of table and partition maintenance can be reduced, thus reducing the complexity.
Impact of Table Design
Affected operations: table creation/data entry/table update/table deletion/table management.
Import data scenarios (determine whether to perform real-time data collection or offline batch data writing):
- Query and compute immediately after import.
- Import multiple times, and query and compute regularly.
- Generate intermediate tables after import for computation.
- Reasonable table design and data integration cycle management can reduce the cost of data during storage. MaxCompute takes precedence as a batch data integration library and computes by business logic, such as by partition.
- Query and compute immediately after the import, and the amount of data imported each time needs to be considered to reduce the streaming import of small amounts of data.
- Unreasonable data import and storage (small files) affect the overall storage performance, computing performance and operational stability.
Steps of Table Design
- Determine the project to which the table belongs. Plan the table type and determine which data level it belongs to according to the business process.
- Define table descriptions, including permission definitions and Owner definitions.
- Define a partitioned table or a non-partitioned table based on the amount of data and the characteristics of data integration.
- Define a field or a partition field
- Create a table/transform a table
- Identify the relevant factors of the imported data scenario (including batch data writing/streaming data writing/striped data insertion).
- Define the lifecycle of table and partition data.
- After the table is created, the table schema can be modified according to business changes, such as setting the lifecycle and RangeClustering.
- During the design phase, special attention needs to be paid to the scenario of distinguishing data (batch data writing/streaming data writing/periodic striped data insertion).
- The non-partitioned tables and partitioned tables need to be used reasonably. It is recommended to use partitioned tables for log tables, fact tables, original collection tables and so on, and partition them by time.
- Make sure to take note of the restrictions of various tables and partitions.
Table Data Storage Specification
Standardize Data Lifecycle by Data Hierarchy
The ODS layer of the source table: all the data synchronized daily from the business system is reserved, and the lifecycle definition is kept permanently. In case downstream data is compromised, it can be recovered from ODS. If ODS synchronizes with a full table daily, storage can be compressed by zipping the full table.
Data warehouse (basic) layer: keep at least one complete full amount of data (no need to be as redundant as ODS). For performance reasons, consider splitting tables or partitioning them.
Data mart layer: 1-3 years as necessary. The data from the data mart is easier to generate, so you don't need to keep the historical data for a very long time.
Standardize Data Preservation by Data Change and History
Customer attributes and product attributes change daily, and the historical changes of these attributes are recorded to facilitate tracing the values to a certain point in time.
The fields of the dimension table are redundant in the fact table, that is, the various dimension attribute values "when an event occurs" are bound to the event. It is more convenient for users to use data in the data application layer without associating multiple tables.
Record changes to dimension tables in the form of zipper tables or daily snapshots. With flexible data structure, convenient extension, simple management, and better data consistency, it is more convenient for data processors. It is used in the data base layer.
Data Import Channel and Table Design
- Datahub, which plans the relationship between written partitions and written traffic and commits every 64 M.
- Data integration or DataX, which plans the frequency of writing table partitions, commits every 64 M and avoids committing empty directories. DTS, which plans the relationship between the existing partition and the incremental partition of the written table and sets the frequency of commit.
- Console (Run SQL or Tunnel upload), which avoids inserting or uploading files having high-frequency and little data.
- The "insert into" in the SDK Run Sql. Note that when uploading files to a table or a partition, you need to organize small files after inserting them into the partition to avoid inserting them to a partitioned or non-partitioned table more than one time, and need to merge after insert.
- MaxCompute can only import data through the Tunnel SDK or by executing INSERT INTO in SQL to avoid streaming insertion.
- Each of the channels above has its own logic for streaming data writing, batch data writing, and periodic scheduling writing.
- When writing a table or a partition, the data channel should control the amount of data written at one time to a reasonable value, such as 64 M or more.
Relationship between Partition Design and Logical Storage
As shown above, the table has a total of m first-level partitions. Each first-level partition stores second-level partitions by time, and each second-level partition stores all columns. Considerations for designing partitions:
- The number of partitions has an upper limit.
- The amount of data in each partition must not be too small.
- Query and compute according to the partitioning criteria.
- Avoid writing data to each partition more than one time.
Basic Rules for Table and Partition Design
All table names and field names must follow a unified naming convention.
- The business type of the table needs to be distinguishable.
- Distinguish whether the table is a "fact table", "dimension table", "log table" or "limits storage table" (function to be published).
- The entity information of the table needs to be distinguishable.
- Fields with the same business meaning in different tables should be defined using a uniform data type.
- Avoid unnecessary type conversions.
General rules for the design and use of partitions:
- Only creating partitions are supported and creating partition columns are not supported.
- The number of partitions supported by a single table is 60,000.
- To add partitions to a table that has multi-level partitions, specify all partition values.
- The name of a partition column cannot be modified. Only the values in that column can be altered. To change the values in one or more partitions among multi-level partitions, you must specify values for partitions at each level.
Selection of Partition Field and Normal Field
The role of the partition field is to facilitate data management and to divide the scope of data scans.
When creating a table, you can set normal fields and partition fields. In most cases, normal fields can be interpreted as the data of data files, while partition fields can be interpreted as directories of file systems. The storage footprint of a table is the footprint of a common field.
Partition columns do not store data directly, but they are just like directories in a file system that facilitate data management. At the same time, if a specific partition is specified during computation, the computation process only needs to query the corresponding partitions, thus reducing the amount of computation input. The number of partition columns in the partitioned table cannot exceed 6, that is, the number of directory layers for storing data in the underlying layer cannot exceed 6 levels. Set the appropriate lifecycle for the partitioned table. The periodic management of some data can be implemented through fine-grained partitioning.
- The corresponding fields can be set as partition fields from the data management scope and the commonly used data scanning scope.
- For fields that are irregular or have more than 10,000 types and are not often used as query criteria, they are set as normal fields.
Definition Basis of Partition Field
Sort by priority:
- The time should be fully taken into account when selecting partition columns, trying to avoid updating existing partitions.
- If multiple fact tables (excluding dimension tables) are joined, the "Where" range is used as the query criteria and its result columns are used as partition columns.
- Choose the column referenced in "Group By" or "Distinct" as the partition column.
- Choose as the partitioning column a column which contains evenly distributed values rather than skewed data.
- If a column is referenced within the "=" and "in" conditions of a commonly used SQL statement, choose the column as the partitioning column.
Select ... from table where id=123 and .... ;
Definition Basis of the Number of Partitions
Time partition: partition can be done by day or by month. If partition is done by hour, the average number of the second-level partitions should not be more than 8.
Geographic partition: partition can be done by province, city and county and the multi-level partition should be considered. For example, 23 provinces, 5 autonomous regions, 4 municipalities and 2 special administrative regions; 50 regions (States, leagues); 661 cities, including: 4 municipalities, 283 prefecture-level cities and 374 county-level cities; 1636 counties (autonomous counties, banners, autonomous banners, special zones and forest areas ). After partitioning according to the most fine-grained county level, no more fine-grained partitioning should be performed by hour.
Data in a single partition is recommended to be submitted once for 64 M data. For the multi-level partition, ensure that the data of the second-level partition under each of the most fine-grained level partitions is in accordance with this rule.
The number of single table partitions (including lower-level partitions) cannot exceed 60,000.
Recommendations on the Number of Partitions and the Amount of Data
Partition clipping can be used in computation, which is the advantage of partitioning. It is recommended that the amount of data in a single partition should not be too large. For example, the number of data entries in a single partition can be 10,000, but 50,000 partitions have been built.
Effort should be made to avoid partition data skew, where the amount of data entries in different partitions of a single table varies by more than 1 million. When designing partitions, the number of partitions should be reasonably planned, because fine-grained partitions affect the execution performance of SQL when scanning across partitions.
In cases where the amount of data in a single partition is large, MaxCompute will do sharding while performing tasks without affecting the advantages of partition clipping. A large number of files in a single partition affects the number of MaxComputeInstance, resulting in resource waste and SQL performance degradation.
Multi-level partitioning is adopted, first by date and then by transaction type. The table is split. Each transaction type is separated into a separate table, and each table is partitioned by date. No partitioning is performed on dimension tables.
To learn more about Alibaba Cloud MaxCompute, visit https://www.alibabacloud.com/product/maxcompute