Community Blog Efficient Data Lake Formation Based on JindoFS and OSS

Efficient Data Lake Formation Based on JindoFS and OSS

This article explains the process of data lake formation based on Alibaba Cloud OSS and JindoFS big data cache acceleration service.

Why Should We Form a Data Lake

Apache HDFS was the preferred solution for establishing data warehouses with massive storage capabilities in the early days of the big data era. With the development of cloud computing, big data, AI, and other technologies, all cloud vendors are constantly improving their object storage technologies to better adapt to Apache Hadoop/Spark and various AI ecosystems. Considering that object storage has the advantages of massive capacity, high security, low cost, high reliability, and easy integration, various IoT devices and website data store different forms of original files in object storage. It is also a consensus in the industry to enhance and expand Big Data AI through object storage. Seeing this trend, the Apache Hadoop community has also launched the native object storage "Ozone." Transferring from HDFS to object storage and data warehouses to a data lake ensures data is stored in the unified storage for more efficient analysis and processing.

For cloud customers, technology selection in the early stage matters a lot for data lake formation. As the amount of data continues to grow, the cost of subsequent architecture upgrades and cloud data migration will also increase. The establishment of large-scale storage systems on the cloud using HDFS by enterprises has exposed many problems. As the native storage system of Hadoop, HDFS has set the storage standard in the big data ecosystem after 10 years of development. Despite being continuously optimized as a storage facility, the NameNode SPOB and JVM bottleneck affect cluster expansion in HDFS. Continuous optimization and cluster splitting is required in HDFS when data volume grows from 1 PB to over 100 PB. Even though HDFS supports up to EB-level data, high O&M costs are incurred to solve problems such as slow startup, signaling storms, node scaling, node migration, and data balancing.

Data lake formation based on the Alibaba Cloud OSS is the best choice among cloud-native big data storage solutions. As an object storage service on Alibaba cloud, OSS has the advantages of high performance, unlimited capacity, high security, high availability, and low cost. JindoFS adapts to OSS and accelerates caching in accordance with the big data ecosystem. It provides special file metadata services to meet the various analysis and computing requirements from cloud customers. Therefore, the combination of JindoFS and OSS on Alibaba Cloud has become the ideal choice for customers to migrate data lake architecture to the cloud.

JindoFS Introduction

Jindo is an on-cloud distributed computing and storage engine customized by Alibaba Cloud based on Apache Spark and Hadoop. Jindo used to be a proprietary code of Alibaba Cloud open-source big data team, which sounds like somersault cloud in Chinese. Jindo has undergone a lot of open-source optimizations and extensions. It is also deeply integrated and connected with many basic Alibaba Cloud services.

JindoFS is a big data cache acceleration service developed by Alibaba Cloud for on-cloud storage. This means JindoFS is elastic, efficient, stable, and cost-effective. JindoFS is fully compatible with the Hadoop file system, leading to more flexible and efficient data lake acceleration. JindoFS also supports all computing services and engines in EMR, including Spark, Flink, Hive, MapReduce, Presto, and Impala. It has two usage modes, namely Block and Cache. The following section introduces the way to configure and use JindoFS in EMR and the scenarios corresponding to different modes:

JindoFS Architecture

JindoFS consists of two service components: namespace service and storage service.

  • Namespace service is mainly responsible for metadata management and storage service management.
  • Storage service is in charge of managing local data on nodes and cached data on OSS.

As shown in the following JindosFS architecture diagram, namespace service is deployed on independent nodes. We recommend deploying three Raft for high service availability in the production environment. Storage service is deployed on computing nodes of the cluster, managing spare storage resources, such as local disks, SSD, and memories. It provides distributed cache capability in JindoFS.


JindoFS Namespace Service

JindoFS namespace service internally stores metadata based on K-V structure. It is better than traditional memory storage in terms of efficient operations, easy management, and easy recovery.

  • Efficient metadata operations: JindoFS namespace service has better performance than memory-based HDFS NameNode as it uses both memory and disk for metadata storage and management. Developed in C++, JindoFS has no Garbage Collection (GC) problems and offers a faster response. Moreover, it has a better internal design. For example, it has finer-grained locks on file metadata and a more efficient management mechanism for data block replica.
  • Second-level startup: On-cloud storage experts understand that when there are over 100 million metadata stored in HDFS, the initialization of the Namenode takes a long time. Before starting the initialization, the NameNode must load the Fsimage and merge edit log, and then wait for all DataNodes to be reported to Block. The whole process may take at least one hour because the NameNode is in Active/Standby mode. HDFS will stop operating for more than one hour if the active node or both go wrong during standby node restarting. Namespace service of JindoFS achieves high availability based on Raft and supports 2N + 1 deployment, allowing several nodes to fail simultaneously. Namespace service also pays efforts in the design and optimization of the internal metadata storage so as to provide services immediately with a quick response once started. As namespace service can write in OTS in near real-time, it is also easy to change the metadata nodes and even migrate the entire cluster.
  • Low resource consumption: HDFS NameNode stores file metadata in memory. This practice has good performance on a certain scale but limits the HDFS metadata scale to the memory capacity of nodes. According to calculation, storing 100 million HDFS files requires about 60 GB of Java Heap to the NameNode. In this way, a 256-GB server can manage a maximum of about 400 million metadata and needs to constantly optimize JVM GC. However, JindoFS uses RocksDB to store metadata, which can easily store up to 1 billion metadata. JindoFS also has small memory requirements and low resource overhead, which is less than 10% of that of the NameNode.

JindoFS Storage Service

JindoFS storage service mainly provides high-performance cache acceleration locally, thus greatly simplifying O&M.

  • Elastic O&M: HDFS uses the DataNode to manage node storage on storage nodes. All data blocks are stored on the disks of the nodes. The storage status is reported to the NameNode through regular DataNode checking and heartbeat mechanism. After summarization and calculation, the NameNode dynamically ensures that the replica number of file data blocks reaches the set number (3 replicas in general). For large-scale clusters with over 1,000 nodes, several operations are expected, such as scaling out cluster nodes, migrating nodes, decommissioning nodes, and balancing node data. These node-related operations cannot be performed only after the completion of scheduling the NameNode replicas. In addition, the computing of massive replicas of data blocks increases the NameNode load. Typically, it takes several hours to decommission a storage node. JindoFS, however, uses storage service to manage the storage on nodes. JindoFS ensures that data is replicated on OSS, so local replicas are mainly used to accelerate cache. For operations such as node migration and node decommissioning, JindoFS requires no complex replica computing and can decommission nodes by quickly marking up nodes.
  • High-performance storage: Developed in C++, JindoFS storage service has natural advantages in connecting to the latest high-performance storage hardware. The storage back end of the storage service can connect to SSD, local disks, and OSS to meet the mass and high-performance storage access requirements of big data frameworks like Hadoop/Spark. It can also connect to high-performance devices like memory and AEP to meet the low-latency and high-throughput storage requirements of AI and machine learning.

JindoFS Application Scenarios

The metadata of JindoFS is stored in the namespace service (highly available deployment) of the Master nodes, reaching the same level as HDFS in terms of performance and user experience. Storage service of Core nodes stores one data block on OSS, enabling quick elastic scaling of local data blocks in line with node resources. You can also interconnect multiple clusters.


To support multiple usage scenarios of a data lake, one set of JindoFS deployment provides two usage patterns of OSS, namely Block and Cache.

  • Cache: You can access the data stored in OSS using Cache mode available in JindoFS. As "cache" means, a distributed cache service is built in the local cluster based on the storage capability of JindoFS. Remote data is cached in the local cluster to be "localized." You can access files through original paths, such as oss://bucket1/file1. With Cache, all files are stored in OSS, and elastic usage of cluster-level can be realized.
  • Block: Block is suitable for high-performance data processing. The metadata is stored in namespace service (highly available deployment), reaching the same level as HDFS in terms of performance and user experience. Storage service stores one data block in OSS, enabling quick elastic scaling of local data blocks in line with node resources. With these features, JindoFS can serve as the core storage for establishing high-performance data warehouses. Multiple computing clusters can access data in the primary cluster of JindoFS.

Advantages of JindoFS Solution

Compared with other solutions, a data lake solution based on JindoFS and OSS has advantages both in performance and cost.

  • Performance: JindoFS has conducted comparison tests on some common scenarios and Benchmark, with tools and engines such as DFSIO, NNbench, TPCDS, Spark, and Presto. These tests show that the performance of Block is completely ahead of that of HDFS, and Cache is completely ahead of the OSS SDK implementation of the Hadoop community. A detailed test report will be released later.
  • Cost: Cost is a major consideration for enterprises during cloud migration. The cost advantages of JindoFS are reflected in both O&M and storage. O&M costs refer to the routine maintenance of clusters, and node decommissioning and migration. As mentioned earlier, when the HDFS cluster grows to a certain scale (for example, with more than 10-PB data stored), professional optimization and business split planning are required to avoid the bottleneck of the HDFS metadata. In addition, as cluster data continuously grows, some nodes and disks may encounter problems, requiring node decommissioning and data balancing. This makes it difficult to operate and maintain a large cluster. To solve these problems, JindoFS supports the storage mode of OSS and OTS. OSS retains the backup of original files and data blocks and ensures better compatibility with node and disk problems. Developed in C ++ and improved through engineering, namespace service has the edge over the NameNode and JVM in capacity and performance.

The following part introduces advantages in terms of storage costs. Before that, let's understand what storage cost is? Storage cost refers to the cost incurred after storing data. OSS charges in pay-as-you-go mode, which is more cost-saving than HDFS clusters created based on local disks:

Establish Big Data Storage Based on HDFS and Local Disks

Refer to this: https://www.alibabacloud.com/product/ecs

Data Lake Formation Based on JindoFS Acceleration

Refer to this: https://www.alibabacloud.com/product/oss/pricing

When accelerating OSS data caching, more disk space of the computing node is required, resulting in a cost increase. The increased cost generally depends on the scale of hot data or data to be cached, yet has little to do with the total amount of data to be stored. The increased cost can lead to computing efficiency improvement and computing resource savings. The overall effect can be evaluated based on the actual scenario.

JindoFS Ecosystem

A data lake is open and requires connections to various computing engines. Currently, JindoFS supports components, including Spark, Flink, Hive, MapReduce, Presto, and Impala. To make better use of a data lake, JindoFS enables JindoTable to optimize structured data and accelerate queries. JindoFS also encourages JindoDistCp to support offline data migration from HDFS to OSS. Moreover, JindoFuse of JindoFS can accelerate machine learning training in a data lake.

0 0 0
Share on

Alibaba EMR

58 posts | 5 followers

You may also like