Join us at the Alibaba Cloud ACtivate Online Conference on March 5-6 to challenge assumptions, exchange ideas, and explore what is possible through digital transformation.
A distributed file system stores data by blocks. The size of a block is 64 MB by default and a file smaller than a block is called a small file. Having too many small files on MaxCompute can be problematic, which we will discuss in further detail in this article.
To determine whether there are too many small files, run the following command to check the number of files:
desc extended + Table name
We know that having too many small files in a table can be suboptimal for our MaxCompute table, but how do we determine the threshold for the number of files? Let's look at the criteria for determining whether there are too many small files in a table:
Note: Although MaxCompute automatically merges small files for optimization purposes, you still need to use the appropriate table partition design and data upload method to avoid small files generated due to causes 1, 2, and 3.
It takes less time for MaxCompute to process a single large file than multiple small files. Too many small files affect the overall performance of MaxCompute, increase the pressure on the file system, and decrease the space utilization. The number of small files that MaxCompute can process for a Fuxi instance is limited to 120. An excessively large number of files affects the number of Fuxi instances and the overall performance of MaxCompute.
If small files are unavoidable, you can consider merging them. To do this, you can run the following command:
set odps.merge.max.filenumber.per.job=50000;
The default value is 50000. If the number of partitions is greater than 50000, adjust maximum number of small files to 1000000. If the number of small files exceeds 1000000, merge small files multiple times using
ALTER TABLE Table Name [partition] MERGE SMALLFILES;
If your table is already a partition table, check whether the partition fields can be converged. Too many partitions also affect the computing performance of MaxCompute. We recommend that you partition the table by date.
Example:
insert overwrite table tableA partition (ds='20181220')
select * from tableA where ds='20181220';
If your table is a non-partition table, periodically run the command for merging small files. However, we recommend that you design the table as a partition table:
Example:
create table sale_detail_patition like sale_detail;
alter table sale_detail_insert add partition(sale_date='201812120', region='china');
insert overwrite table sale_detail_patition partition (sale_date='20181220', region='china')
select * from sale_detail;
Note: If you use "insert overwrite" to re-write the full data to the merged small files, "insert overwrite" and "insert into" cannot coexist; otherwise, the data may be lost.
Design the table partition properly. Whenever possible, design partition fields that can be converged or managed. An excessive number of partitions also affects the computing performance of MaxCompute. We recommend that you partition the table by date and set the lifecycle properly to facilitate the recycling of historical data and control of your storage costs.
To learn more about optimal table design, read MaxCompute Table Design Specification and Best Practices for MaxCompute Table Design
Avoid using various data integration tools that generate small files.
Tunnel -> MaxCompute
Avoid frequent commit operations when using Tunnel to upload data. Whenever possible, ensure that the size of data submitted each time is greater than 64 MB. For more information, see https://www.alibabacloud.com/help/doc-detail/27833.htm
DataHub -> MaxCompute
If you are using DataHub to generate small files, we recommend that you apply for shards properly and merge shards based on the topic throughput to reduce the number of shards. You can observe the data traffic changes based on the topic throughput and appropriately increase the interval between data write operations.
The policy for applying for the number of DataHub shards is as follows (too many DataHub shards will result in an excessively large number of small files):
Recommendation: If the traffic is 5 MB/s, apply for five shards. To reserve a buffer of 20% to cope with traffic peaks, you can apply for six shards.
DataX -> MaxCompute
DataX also encapsulates the Tunnel SDK to write data to MaxCompute. Therefore, we recommend that you set blockSizeInMB to a value greater than 64 MB when configuring ODPSWriter.
Troubleshooting MaxCompute and DataWorks Permission Problems
137 posts | 20 followers
FollowAlibaba Cloud MaxCompute - January 18, 2019
Alibaba Cloud MaxCompute - September 12, 2018
Alibaba Cloud MaxCompute - December 13, 2018
Alibaba Cloud MaxCompute - August 27, 2021
Alibaba Cloud MaxCompute - September 12, 2018
Alibaba Cloud MaxCompute - February 18, 2024
137 posts | 20 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba Cloud MaxCompute