Alibaba Cloud JindoFS easily handles the pressure test of 1 billion+ files

Main introduction

Apache Hadoop File System (HDFS) is a widely used big data storage solution. Its core metadata service NameNode stores all metadata in memory, so the size of metadata that can be carried is limited by memory. The number of files that a single instance can support is about 400 million. JindoFS block mode is a storage optimization system developed by Alibaba Cloud based on OSS mass storage, providing efficient data read and write acceleration and metadata optimization capabilities. The memory limit on the NameNode is avoided in the design. The difference from HDFS is that the JindoFS metadata service uses RocksDB as the underlying metadata storage. RocksDB can be stored on a large local high-speed disk, which solves the memory capacity bottleneck problem. With the help of memory cache, 10%~40% of hot file metadata is stored in the memory cache to maintain stable and excellent read and write performance. With the help of Raft mechanism, JindoFS metadata service can form three active and standby instances to achieve high service availability. How is the actual performance of JindoFS? We conducted a pressure test under the scale of 1 billion files to verify whether JindoFS can maintain stable performance when it reaches this scale. At the same time, we also made a test comparison with HDFS on some key metadata operations.

JindoFS 1 billion files test

The number of files that a single instance of HDFS NameNode can support is about 400 million, mainly due to the limitation of memory size. In addition, due to the increase in the number of files, the number of DataNode reporting blocks that need to be processed also increases, resulting in a huge jitter in performance. A large amount of file information is saved in a large FsImage file for loading at the next startup, and the large FsImage file makes it take more than 10 minutes for NameNode to start.

JindoFS solves the above series of problems. It uses RocksDB to store metadata. Compared with NameNode, it can store a larger number of files and is not limited to memory. In addition, the Worker node is not required to report block information, and there is no performance jitter problem. The JindoFS metadata service can be started within 1s and the primary and standby nodes can be switched within milliseconds. Therefore, in this test, we tested the growth of JindoFS from 100 million files to 1 billion files to test whether it can maintain stable performance.

testing environment

Dataset (4 sets in total)

To test the performance of JIndoFS metadata service under different metadata scales. We prepare four sets of data. They are: initial status (0 files), 100 million files, 500 million files, and 1 billion files. We use a real HDFS FsImage file desensitized by users to restore it to the JindoFS metadata service. The file size is 1:1 and the block information is created accordingly and stored in JindoFS metadata. The final data set is as follows.

Metadata disk space occupation

In addition, the directory level is mainly distributed at level 5 to 7. The file size distribution and directory level distribution of the dataset are somewhat close to the production environment.

NNBench test

NNBench, the full name of NameNode Benchmark, is an official tool of HDFS for testing the performance of NameNode. Because it uses the standard FileSystem interface, we can use it to test the performance of the JindoFS server. The execution parameters of NNBench are as follows:

Start 200 Map Tasks, each task writes (reads) 5000 files, a total of 1 million files. (Limited by the size of the test cluster, the actual number of maps executed at the same time is 128)

test result

NNBench's results provide a good feedback on the performance change curve of metadata services as the size of metadata grows. From the results, we can conclude that:

1. When the number of files reached 1 billion, writing to the TPS was slightly affected, and the TPS dropped to 88%.

2. When the number of files reached 500 million, the reading TPS was slightly affected, and the TPS dropped to the original 94%. However, when the number of files reached 1 billion, the reading of TPS remained stable, basically the same as when the number of files reached 500 million.

TPC-DS test

The official TPC-DS data set, 5TB data volume, and ORC format are used. Spark is used as the execution engine for testing.

Through observation, it is found that the TPC-DS performance is basically unaffected with the increase of metadata size from 0 to 1 billion files without the influence of error.

Ls - R/count test

The above NNBench tool mainly tests the performance of metadata service single-point write and single-point query under high concurrency. However, file list export (ls - R) operation and file size statistics (du/count) operation are also frequently used by users. The execution time of these commands reflects the execution efficiency of metadata service traversal operation.

We use two sample data for testing:

1. Execute ls - R operation on a table (half year data, 154 partitions, 2.7 million files), count the execution time, and use the following command

time hadoop fs -ls -R jfs://test/warehouse/xxx.db/tbl_xxx_daily_xxx > /dev/null

2. Execute the count operation on a database (500000 directories and 18 million files) and count the execution time. Use the following command

The test results show that for traversing (ls - R/count) the same number of files (directories), the performance of the metadata service remains stable and will not change with the growth of the total amount of metadata.

For the number of files at the level of 1 billion, the disk occupies nearly 100GB, and the JindoFS metadata service only caches part of the hot file metadata. Will the page cache of the metadata file affect the performance? We tested this.

Hot start: directly restart the metadata service. At this time, the system has page cache.

Cold start: we use the command echo 3>/proc/sys/vm/drop_ Caches clears the cache and restarts the metadata service.

The test results are as follows (using 1 billion file data sets)

Through observation, it was found that under the cold start condition, these operations took about 0.2 seconds longer and were only slightly affected.

Horizontal comparison test with HDFS

Through the above test, we know that JindoFS still maintains stable performance under 1 billion files. In addition, we added and tested the comparison between JindoFS and HDFS. Since HDFS requires a machine with extremely high specifications to store 1 billion files, we mainly test the 100 million file number scenario in this round of test. We compare the performance differences between the two through horizontal comparison of common operations such as list, du, count, etc.

Sample Description

Extract 4 groups of directories a, b, c and d,

Directory a: The Hive warehouse directory contains 317000 directories and 12.5 million files;

Directory b: A database directory contains 12000 directories and 320000 files;

Directory c: A table directory contains 91 directories and 77000 files;

Directory d: Spark results storage directory contains 42000 directories and 71000 files;

Test results (shorter time and better performance)

Single-layer list operation

Expand and output the single-level directory. Sampling method: time hadoop dfs - ls [DIR]>/dev/null

Recursive list operation

Expand the directory layer by layer and output it. Sampling method: time hadoop dfs - ls - R [DIR]>/dev/null

Du operation

Calculate the storage space occupied by the directory. Sampling method: time hadoop dfs - du [DIR]>/dev/null

Count operation

Calculate the number and capacity of files (folders) in the directory. Sampling method: time hadoop dfs - count [DIR]>/dev/null

Result analysis

Through the above test results, it can be clearly found that JindoFS is significantly faster than HDFS in common operations such as list, du and count. The reason is that the HDFS NameNode uses a global read and write lock in memory, so the read lock is required for query operations, especially for recursive query operations of directories. After taking the lock, the single-thread serial method is used for directory recursive operation, which is slow. The long lock holding time affects the execution of other rpc requests. JindoFS solves these problems by design. It uses multithread concurrent acceleration for the recursive operation of the directory, so it is faster for the recursive operation of the directory tree. At the same time, different directory tree storage structures and fine-grained locks are used to reduce the impact between multiple requests.

Summary

JindoFS block mode can easily store 1 billion+files and provide high-performance read and write request processing capabilities. Compared with HDFS NameNode, it occupies less memory, has better performance and is easier to operate and maintain. We can use JindoFS as the storage engine to store the underlying data on object storage (such as OSS), and use the local cache acceleration capability of JindoFS to form a stable, reliable, high-performance big data storage solution on the cloud, providing strong support for the upper computing analysis engine.

In addition, the JindoFS SDK can be used separately to replace the Hadoop community OSS client implementation. Compared with the Hadoop community implementation, the JindoFS SDK has made a lot of performance optimization on the ability to read and write OSS. You can visit github repo for download and use.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us