High-performance data lake storage solution for big data on the cloud
The separation of computing and storage has become a development trend of cloud computing. Before the separation of computing and storage, the traditional computing-storage fusion architecture (the left side of the figure below) was commonly used. However, this architecture has certain problems. For example, when the cluster expands, the computing and storage capabilities will not match each other. The problem. In some cases, users only need to expand computing power or storage capacity, but the traditional fusion architecture cannot meet the needs of users, and can independently expand computing or storage capacity; secondly, manual intervention may be encountered when shrinking capacity. After manual intervention, it is necessary to ensure that the data is synchronized in multiple nodes, and when there are multiple copies that need to be synchronized, data loss may result. The computing-storage separation architecture (right side of the figure below) can solve these problems well, so that users only need to care about the computing power of the entire cluster.
EMR's existing computing and storage separation solution is based on the OssFS developed by OSS to provide a Hadoop-compatible file system. It has the capability of mass storage, and users do not need to worry that the storage capacity cannot meet business needs; in addition, because OssFS can access data on OSS, it also retains some advantages of OSS, such as low cost and high reliability. However, OssFS also has some disadvantages. For example, data may be moved to the final directory after the end of the task. During this process, file movement and renaming operations are slow. This is mainly due to the fact that the semantics of the file system are simulated on the OSS system. File system objects Storage does not support atomic operations for renaming or moving; OSS is a public cloud service, and its bandwidth is limited; frequently accessed data consumes too much OSS bandwidth. In contrast, JindoFS overcomes the aforementioned problems while retaining the advantages of OssFS.
Introduction of EMR JindoFS
1) Architecture Introduction
The overall architecture of EMR JindoFS is shown in the figure below, which mainly includes: Namespace service and Storage service. Namespace service is mainly responsible for JindoFS metadata management and Storage service management, and Storage service is mainly responsible for user data management (local data and remote OSS data). EMR JindoFS is a cloud-native file system that can provide the performance of local storage and the large capacity of OSS.
• Namespace service
Namespace service is the central service of JindoFS. It is mainly responsible for managing user metadata, including metadata of JindoFS's own file system, metadata of split blocks, and metadata of Storage services. JindoFS Namespace service can support multiple Namespaces. Users can divide different Namespaces according to different businesses. Different Namespaces store different business data without interfering with each other. Namespace does not set a global big lock, and can implement concurrency control based on the directory level, and perform concurrent creation and concurrent deletion. In addition, Namespace can set different back-end storage. At this stage, it mainly supports RocksDB and Alibaba Cloud OTS. OTS support is expected to be released in the next version. The advantage is that if you create your own cluster in EMR and use OTS as the data back-end, the local EMR cluster can Destroy at any time. When the cluster is created, JindoFS data can still be accessed, which increases the flexibility of users.
• Storage service
The Storage service is mainly used to manage the data on the cluster's local disk, locally cached data, and OSS data. It currently supports different storage backends and storage media. The storage backend mainly supports local file systems and OSS at this stage. The local storage system can support storage media such as HDD/SSD/DCPM to provide cache acceleration. In addition, the Storage service is aimed at The scenario where the user has many small files is optimized to avoid too much pressure on the local file system caused by too many small files, resulting in a decrease in overall performance.
In terms of the entire ecology, JindoFS currently supports all big data components, including Hadoop, Hive, Spark, Flink, Impala, Presto, and HBase. When users access EMR JindoFS, they only need to replace the file access path mode with jfs. In addition, in terms of machine learning, the next version of JindoFS will launch a Python SDK, so that machine learning users can efficiently access data on JindoFS. EMR JindoFS also achieves highly integrated optimization with EMR Spark, such as integrating Relational Cache, Spark-based materialized views, and Cube optimization.
EMR Jindo Usage Mode
There are two main usage modes of EMR-Jindo: Block mode and Cache mode.
• Block mode
Block mode stores JindoFS files in the form of blocks divided into local disks and OSS. Users can only see block data through OSS. This mode is very dependent on the local Namespace file system, and the local Namespace service is responsible for managing metadata. , construct file data through local metadata and Block data. Compared with the latter mode, the performance of JindoFS in this mode is the best. Block mode is suitable for scenarios where users have certain performance requirements for data and metadata. Block mode requires users to migrate data to JindoFS.
Block mode supports different storage strategies to adapt to different application scenarios of users. The default is the WARM strategy,
a) WARM: This is the default strategy. Data is backed up in OSS and locally. Local backup can effectively provide subsequent reading acceleration, and OSS backup plays a role of high availability;
b) COLD: Only one backup of data exists on OSS, and there is no local backup, which is suitable for cold data storage;
c) HOT: Data is backed up on one OSS, and multiple local backups, providing further acceleration for some of the hottest data;
d) TEMP: There is only one local backup for data, which provides high-performance reading and writing for some temporary data, but sacrifices high reliability of data, and is suitable for accessing some temporary data.
Compared with HDFS, the Block mode of JindoFS has the following advantages:
a) Utilizing the cheap and unlimited capacity of OSS, it has the advantages of cost and capacity;
b) The hot and cold data are automatically separated, and the calculation is transparent. When the hot and cold data are automatically migrated, the logical location remains unchanged, and there is no need to modify the location information of the table metadata;
c) The maintenance is simple, no decommission is required, and the node will be removed if it is broken, and the data will not be lost if it is on the OSS;
d) Rapid system upgrade/restart/recovery without block report;
e) Native support for small files, avoiding excessive pressure on the file system caused by the process of small files.
• Cache mode
Different from Block mode, Cache mode stores JindoFS files in the form of objects on OSS. This mode is compatible with the existing OSS file system. Users can access the original directory structure and files through OSS. At the same time, this mode provides data and metadata cache to accelerate the performance of users' reading and writing data. Users who use this mode do not need to migrate data to OSS, and can seamlessly connect data on the existing OSS, but there is a certain performance loss compared with the Block mode. In terms of metadata synchronization, users can choose different metadata synchronization strategies according to different needs.
Compared with OssFS, the Cache mode of JindoFS provides the following advantages:
a) Due to the existence of local backup, the read and write throughput is equivalent to that of HDFS;
b) It can support all HDFS interfaces and more scenarios, such as Delta Lake and HBase on JindoFS;
c) JindoFS is used as a data and metadata cache, and users can improve performance in reading and writing data and List/Status operations compared to OssFS;
d) As a data cache, JindoFS can speed up the user's data reading and writing.
The external client provides a way for users to access JindoFS outside the EMR cluster. At this stage, the client only supports the Block mode of JindoFS. The permissions of the client are bound to the permissions of the OSS. Users need to have the corresponding permissions of the OSS to pass through the external client. Access JindoFS data.
EMR JindoFS + DCPM performance
Let's share how to use Intel's new hardware Optane DC persistent memory to accelerate EMR JindoFS. The following figure shows a typical usage scenario of Intel Optane data center-class persistent memory. Starting from the bottom storage layer, in the RDMA/Replication scenario, using new storage media can achieve higher I/O performance; At the infrastructure layer, high-capacity persistent memory is a better solution when intensive applications create more instances; the database layer, especially IMDB, can use SAP HANA and Redis through high-capacity persistent memory Memory supports the creation of more instances; similarly, in the fields of AI and data analysis, low-latency devices can be used to accelerate real-time data analysis, such as SAS, and Databricks can be used to accelerate machine learning scene analysis; in addition, it can be used in the fields of HPC and COMMS to persistent memory.
The configuration of the performance test environment using DCPM to accelerate EMR JindoFS is shown in the figure below. Among them, Spark uses EMR Spark, which has made a lot of modifications based on the open source Spark 2.4.3 version, and added many new features, including Relational Cache and JindoFS; the way of using persistent memory is SoAD, which is very important for users. A fast device; the benchmark test uses three levels, namely Micro-benchmark, TPC-DS queries and Star Schema Benchmark.
The performance test results of EMR JindoFS after using DCPM show that: DCPM brings significant performance improvement for small file reading, especially in the scenario of larger files and more reading processes; the performance improvement is more obvious; using decision support related queries as a benchmark test , Under 2TB data, DCPM brings 1.53 times performance improvement for 99 query executions; also under 2TB data, DCPM brings overall 2.7 times performance improvement for SSB with spark relational cache.
The figure below shows the performance test results of Micro-benchmark. It mainly tests 100 small file read operations under different file sizes ( 512K, 1M, 2M, 4M and 8M ) and different degrees of parallelism (1-10). From the figure, we can see It can be seen that DCPM has significantly improved the performance of small file reading. The larger the file, the higher the parallelism and the more obvious the performance improvement.
The figure below shows the test results of TPC-DS. The TPC-DS data volume is 2TB, and 99 queries of the entire TPC-DS are tested. Based on normalized time, DCPM brings a 1.53x performance improvement overall. Specifically analyze the root cause of the performance improvement, as shown in the two sub-figures on the right side of the figure below, the peak value of the read operation memory bandwidth is 6.23GB/s, and the peak value of the write operation is 2.64GB/s, and there are many read scenarios in Spark, which is also the performance The reason for the promotion.
The following figure shows the performance test results of SSB in Spark Relational Cache + JindoFS, where SSB (Star Benchmark Test) is a TPC-H-based test benchmark for star database system performance. Relational Cache is an important feature supported by EMR Spark. It mainly accelerates data analysis by pre-organizing and pre-computing data, and provides functions similar to traditional data warehouse materialized views. In the SSB test, each query is executed individually using 1TB of data, and the system cache is cleared between each query. Based on normalized time, overall DCPM can bring about 2.7 times performance improvement. For a single query, the performance improvement ranges from 1.9 to 3.4 times. To sum up, the new DCPM hardware can not only solve the I/O problem, but also bring end-to-end performance improvement to JindoFS.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00