Overview and usage of JindoFS - E-MapReduce - Alibaba Cloud Documentation Center

JindoFS is a Hadoop-compatible file system (HCFS) built for open source big data ecosystems based on Alibaba Cloud Object Storage Service (OSS). JindoFS provides three storage modes to store data in OSS: client-only mode (SDK), cache mode, and block storage mode. JindoFS in client-only mode or cache mode optimizes the access to OSS from computing engines of Hadoop and Spark ecosystems. JindoFS in block storage mode provides a tremendous storage capacity by using OSS as the storage backend and supports efficient metadata queries.

Client-only mode (SDK)

In this mode, JindoFS provides features similar to OSS FileSystem and S3A FileSystem in the Hadoop community. JindoFS optimizes the access to Alibaba Cloud OSS and various operations on data for computing frameworks such as Hive and Spark. This mode does not change the way files or objects are organized in OSS. Files are still stored in OSS. JindoFS provides only client connection, extension, adaptation, and optimized access for the Hadoop ecosystem. To use JindoFS in this mode, you need only to upload the JAR package of JindoFS SDK to the classpath directory. This mode is simple and easy to use and requires no deployment of distributed services. SDK

Cache mode

This mode is compatible with the client-only mode (SDK) and accelerates data caching for OSS by using the distributed data caching capability of Jindo. This helps meet large-scale data analysis requirements and throughput-related requirements for training. On the basis of the client-only mode (SDK), the cache mode supports metadata caching and distributed data caching and maintains data compatibility and synchronization with OSS. Data can be cached in memory, SSDs, and basic disks to suit different computing scenarios. Cache

Block storage mode

In this mode, JindoFS provides features similar to Apache Hadoop HDFS. JindoFS can cache data to accelerate data access. It can also organize and store data and manage file metadata. In this mode, JindoFS serves as an independent storage system, but files are stored as blocks in OSS. Block

Comparison between the cache mode and block storage mode

Both modes allow JindoFS to store data in OSS and determine whether to cache the data in local clusters to accelerate data access based on the usage of local storage space.

The essential difference between the two modes lies in the file storage methods in OSS. JindoFS in block storage mode manages directories and file metadata and stores files as blocks in OSS. JindoFS in cache mode stores files as objects in OSS.

Comparison among the three modes

The following table describes the three modes in multiple dimensions.


Dimension	Client-only mode (SDK)	Cache mode	Block storage mode
Storage cost	Stores full data in OSS. Supports the Archive storage class.	Stores full data in OSS. Caches hot data, which accounts for 20% of the total amount of data. Supports the Archive storage class.	Stores full data in OSS. Caches warm data and hot data, which account for 60% of the total amount of data. Supports the Archive storage class. Supports transparent compression.
Scalability	High	Relatively high	Medium
Throughput	Depends on the bandwidth occupied by OSS.	Depends on the bandwidth occupied by OSS and the bandwidth consumed for caching hot data.	Depends on the bandwidth occupied by OSS and the bandwidth consumed for caching warm data and hot data.
Metadata	Simulates HDFS to manage metadata and does not support directory-based storage and file semantics. Supports exabytes of data.	Simulates HDFS to manage metadata and does not support directory-based storage and file semantics. JindoFS can cache file data. Supports exabytes of data.	Provides the highest performance. The compatibility of JindoFS in this mode is close to that of HDFS. Supports more than 1 billion files.
Maintenance workload	Low	Medium Requires the O&M of the cache system.	Relatively high Requires the O&M of the Namespace Service and Storage Service.
Security	Supports AccessKey pair-based authentication. Supports RAM authentication. Supports OSS access logs. Supports encryption of OSS data.	Supports AccessKey pair-based authentication. Supports RAM authentication. Supports OSS access logs. Supports encryption of OSS data.	Supports AccessKey pair-based authentication. Allows you to run UNIX commands or use Ranger to manage permissions of JindoFS in this mode. Supports audit logs generated by AuditLog. Supports data encryption.
Usage	Only allows you to specify an OSS directory in the oss://<oss_bucket>/<oss_dir>/ format to access files. Cross-service access to the OSS directory is supported.	Allows you to specify an OSS directory in the oss://<oss_bucket>/<oss_dir>/ format to access files. Cross-service access to the OSS directory is supported. The caching feature can be enabled. This is the default method. Allows you to specify a JindoFS directory in the jfs://<your_namespace>/<path_of_file> format for one of the deployed namespaces to access data. Cross-service access to the JindoFS directory is not supported. The caching feature can be enabled. Note For more information about how to use JindoFS in cache mode, see the documentation of JindoFS in cache mode.	Only allows you to specify a JindoFS directory in the jfs://<your_namespace>/<path_of_file> format for one of the deployed namespaces to access data. Cross-service access to the JindoFS directory is not supported. The caching feature can be enabled. Note For more information about how to use JindoFS in block storage mode, see the documentation of JindoFS in block storage mode.

FAQ

Q: What mode is recommended for typical data lake scenarios?
A: The client-only mode (SDK) and the cache mode are fully compatible with object storage semantics of OSS and provide complete compute-storage separation and flexible scalability. We recommend that you use the client-only mode (SDK) or the cache mode for big data analysis and AI training acceleration in typical data lake scenarios.
Q: Why does JindoFS in block storage mode provide higher performance than HDFS?
A:
- JindoFS in block storage mode can process more than 1 billion files. However, HDFS can process only a maximum of 0.4 billion files. In addition, the performance of JindoFS in block storage mode is more stable at peak hours of cluster business.
- JindoFS in block storage mode has no limits on on-heap memory in Java and memory usage and can process data at a larger scale than HDFS. HDFS has limits on on-heap memory in Java.
- JindoFS in block storage mode requires lightweight O&M. You do not need to worry about damaged disks or anomalous nodes. Data has one backup on OSS, and nodes can be connected or disconnected.
- JindoFS in block storage mode can transparently compress and archive cold data. It uses various means to optimize costs and connects to OSS to support exabytes of data.
- JindoFS in block storage mode supports some important features of HDFS, such as HDFS AuditLog, integration with Ranger, and data encryption.
Q: What are the special advantages of JindoFS in block storage mode?
A:
- JindoFS in block storage mode can manage file metadata and organize file data. Therefore, it can fully meet the requirements of various big data engines on storage interfaces. These interfaces include but are not limited to the interface to implement the atomicity and transaction processing of rename operations, the interface to implement high-performance local data writing, the interface to implement transparent compression, and the truncate, append, flush, sync, and snapshot interfaces. These high-level storage interfaces are required to achieve complete POSIX and are used to connect more big data engines, such as Flink, HBase, Kafka, and Kudu, to OSS. JindoFS in client-only mode (SDK) or JindoFS in cache mode can also use some interfaces to access OSS. However, the capabilities and advantages of these two modes are insufficient.
- The block storage mode is more cost-effective than the other two modes. This is because in block storage mode, warm data and hot data, which account for 60% of the total amount of data, are cached in local clusters. Therefore, you can read a large amount of data from your local cluster instead of OSS.